Concatenate, merge and join in Pandas

In this post I will show the different ways to join dataframes when we have data in several of them.

Pandas offers us three different types of functions to unite dataframes according to the functionality we want to obtain. Moreover, they are very similar to the database union such as SQL Server.

The most used functions in Pandas are:

  • Concatenate
  • Merge
  • Join

To understand the results of each one, we are going to create dataframes with some information about students and universities.

df_people_1 = pd.DataFrame([('Peter', 24, 'Male', 3),
             ('Phillip', 29, 'Male', 1)], None, ['Name', 'Age', 'Gender', 'University'])

df_people_2 = pd.DataFrame([('Mabel', 18, 'Female', 1),
             ('Bill', 25, 'Male', 2),
             ('Corinne', 22, 'Female', 3)], None, ['Name', 'Age', 'Gender', 'University'])

df_university = pd.DataFrame([(1,'Harvard University'),
             (2, 'Oxford University'),
             (3, 'Stanford University')], None, ['Id','University'])
Name Age Gender University
Peter 24 Male 3
Philip 29 Male 1
People Dataframe (1)
Name Age Gender University
Mabel 18 Female 1
Bill 25 Male 2
Corinne 22 Female 3
People Dataframe (2)
Id University
1 Harvard University
2 Oxford University
3 Stanford University
Universities Dataframe

Concatenate in Pandas

Using concatenate, you can unite two, or more, dataframes without considering their columns this is because is the simplest function.
To concatenate, you need to add all the dataframes in a list and use the concat() function, being able to expand the X or Y axis.

pd.concat([df_people_1, df_people_2, df_university])

Normally, concat() is used to unite dataframes that have the same format and that have been collected from different sources. However, to unite dataframes with different formats but which have some field in common, merge is the best option.

Merge in Pandas

Merge allows us to unite different dataframes having as relation columns or rows. The relationship you want between the dataframes is assigned in the 'how' parameter and the column of each dataframe that looks for that relationship is passed in the 'left_on' and 'right_on' parameter (or in 'on' if it is called the same).

Inner Join

Inner Join

Inner Join
df = pd.merge(df_people_1, df_university, how='inner', left_on='University', right_on='Id')
Name Age Gender University_x Id University_y
Peter 24 Male 3 3 Stanford University
Philip 29 Male 1 1 Harvard University

Left Join

Left Join

Left Join
df = pd.merge(df_people_1, df_university, how='left', left_on='University', right_on='Id')
Name Age Gender University_x Id University_y
Peter 24 Male 3 3 Stanford University
Philip 29 Male 1 1 Harvard University

Right Join

Right Join

Right Join
df = pd.merge(df_people_1, df_university, how='right', left_on='University', right_on='Id')
Name Age Gender University_x Id University_y
Peter 24 Male 3 3 Stanford University
Philip 29 Male 1 1 Harvard University
- - - - 2 Oxford Univ

Outer Join

The operation of outer is the same as the concat in the X axis or just without collapse rows.

Outer Join

Outer Join
df = pd.merge(df_people_1, df_university, how='outer', left_on='University', right_on='Id')
Name Age Gender University_x Id University_y
Peter 24 Male 3 3 Stanford University
Philip 29 Male 1 1 Harvard University
- - - - 2 Oxford University

Join in Pandas

The operation of the join function is very similar to the merge function and the same relations can be made with both. But, some of the most relevant differences are;

  • Merge does by default an inner join and join does by default a left join.
  • Merge can join one or more columns of the second dataframe but join always joins via the index of the second dataframe.

An example of an outer join using this fuction is:

df_people_1.join(df_university, how='outer', lsuffix='_left', rsuffix='_right')
Name Age Gender University_x Id University_y
Peter 24 Male 3 3 Stanford University
Philip 29 Male 1 1 Harvard University
- - - - 2 Oxford University

In a few words

As we saw in the previous post about Pandas, this package allows us to manipulate data easily. And, to make an exploratory analysis, joining several dataframes that have data in common can help us enormously. For this reason, it is important to have a domain and understand what each type of join does because sometimes we may need to relate six, seven, or even more dataframes.


Your subscription could not be saved. Please try again.
Your subscription has been successful. Thank you for joining this great data world.

GET OUR NEWSLETTER

You'll get the latest posts delivered to your inbox.