antonb
antonb

Reputation: 41

How to merge two datasets by specific column in pandas

I'm playing around with the Kaggle dataset "European Soccer Database" and want to combine it with another FIFA18-dataset.

My problem is the name-column in these two datasets are using different format.

For example: "lionel messi" in one dataset and in the other it is "L. Messi"

I would to convert the "L. Messi" to the lowercase version "lionel messi" for all rows in dataset.

What would be the most intelligent way to go about this?

Upvotes: 0

Views: 689

Answers (1)

ASGM
ASGM

Reputation: 11391

One simple way is to convert the names in both dataframes into a common format so they can be matched.* Let's assume that in df1 names are in the L. Messi format and in df2 names are in the lionel messi format. What would a common format look like? You have several choices, but one option would be all lowercase, with just the first initial followed by a period: l. messi.

df1 = pd.DataFrame({'names': ['L. Messi'], 'x': [1]})
df2 = pd.DataFrame({'names': ['lionel messi'], 'y': [2]})

df1.names = df1.names.str.lower()
df2.names = df2.names.apply(lambda n: n[0] + '.' + n[n.find(' '):])

df = df1.merge(df2, left_on='names', right_on='names')

*Note: This approach is totally dependent on the names being "matchable" in this way. There are plenty of cases that could cause this simple approach to fail. If a team has two members, Abby Wambach and Aaron Wambach, they'll both look like a. wambach. If one dataframe tries to differentiate them by using other initials in their name, like m.a. wambach and a.k. wambach, the naive matching will fail. How you handle this depends on the size your data - maybe you can try match most players this way, and see who gets dropped, and write custom code for them.

Upvotes: 1

Related Questions