In this post I will show the different ways to join dataframes when we have data in several of them.
Pandas offers us three different types of functions to unite dataframes according to the functionality we want to obtain. Moreover, they are very similar to the database union such as SQL Server.
The most used functions in Pandas are:
- Concatenate
- Merge
- Join
To understand the results of each one, we are going to create dataframes with some information about students and universities.
df_people_1 = pd.DataFrame([('Peter', 24, 'Male', 3), ('Phillip', 29, 'Male', 1)], None, ['Name', 'Age', 'Gender', 'University'])
df_people_2 = pd.DataFrame([('Mabel', 18, 'Female', 1), ('Bill', 25, 'Male', 2), ('Corinne', 22, 'Female', 3)], None, ['Name', 'Age', 'Gender', 'University'])
df_university = pd.DataFrame([(1,'Harvard University'), (2, 'Oxford University'), (3, 'Stanford University')], None, ['Id','University'])
Name | Age | Gender | University |
Peter | 24 | Male | 3 |
Philip | 29 | Male | 1 |
People Dataframe (1)
Name | Age | Gender | University |
Mabel | 18 | Female | 1 |
Bill | 25 | Male | 2 |
Corinne | 22 | Female | 3 |
People Dataframe (2)
Id | University |
1 | Harvard University |
2 | Oxford University |
3 | Stanford University |
Universities Dataframe
Concatenate in Pandas
Using concatenate, you can unite two, or more, dataframes without considering their columns this is because is the simplest function.
To concatenate, you need to add all the dataframes in a list and use the concat() function, being able to expand the X or Y axis.
pd.concat([df_people_1, df_people_2, df_university])
Normally, concat() is used to unite dataframes that have the same format and that have been collected from different sources. However, to unite dataframes with different formats but which have some field in common, merge is the best option.
Merge in Pandas
Merge allows us to unite different dataframes having as relation columns or rows. The relationship you want between the dataframes is assigned in the 'how' parameter and the column of each dataframe that looks for that relationship is passed in the 'left_on' and 'right_on' parameter (or in 'on' if it is called the same).
Inner Join
Inner Join
df = pd.merge(df_people_1, df_university, how='inner', left_on='University', right_on='Id')
Name | Age | Gender | University_x | Id | University_y |
Peter | 24 | Male | 3 | 3 | Stanford University |
Philip | 29 | Male | 1 | 1 | Harvard University |
Left Join
Left Join
df = pd.merge(df_people_1, df_university, how='left', left_on='University', right_on='Id')
Name | Age | Gender | University_x | Id | University_y |
Peter | 24 | Male | 3 | 3 | Stanford University |
Philip | 29 | Male | 1 | 1 | Harvard University |
Right Join
Right Join
df = pd.merge(df_people_1, df_university, how='right', left_on='University', right_on='Id')
Name | Age | Gender | University_x | Id | University_y |
Peter | 24 | Male | 3 | 3 | Stanford University |
Philip | 29 | Male | 1 | 1 | Harvard University |
- | - | - | - | 2 | Oxford Univ |
Outer Join
The operation of outer is the same as the concat in the X axis or just without collapse rows.
Outer Join
df = pd.merge(df_people_1, df_university, how='outer', left_on='University', right_on='Id')
Name | Age | Gender | University_x | Id | University_y |
Peter | 24 | Male | 3 | 3 | Stanford University |
Philip | 29 | Male | 1 | 1 | Harvard University |
- | - | - | - | 2 | Oxford University |
Join in Pandas
The operation of the join function is very similar to the merge function and the same relations can be made with both. But, some of the most relevant differences are;
- Merge does by default an inner join and join does by default a left join.
- Merge can join one or more columns of the second dataframe but join always joins via the index of the second dataframe.
An example of an outer join using this fuction is:
df_people_1.join(df_university, how='outer', lsuffix='_left', rsuffix='_right')
Name | Age | Gender | University_x | Id | University_y |
Peter | 24 | Male | 3 | 3 | Stanford University |
Philip | 29 | Male | 1 | 1 | Harvard University |
- | - | - | - | 2 | Oxford University |
In a few words
As we saw in the previous post about Pandas, this package allows us to manipulate data easily. And, to make an exploratory analysis, joining several dataframes that have data in common can help us enormously. For this reason, it is important to have a domain and understand what each type of join does because sometimes we may need to relate six, seven, or even more dataframes.