Reputation: 41
I'm playing around with the Kaggle dataset "European Soccer Database" and want to combine it with another FIFA18-dataset.
My problem is the name-column in these two datasets are using different format.
For example: "lionel messi" in one dataset and in the other it is "L. Messi"
I would to convert the "L. Messi" to the lowercase version "lionel messi" for all rows in dataset.
What would be the most intelligent way to go about this?
Upvotes: 0
Views: 689
Reputation: 11391
One simple way is to convert the names in both dataframes into a common format so they can be matched.* Let's assume that in df1
names are in the L. Messi
format and in df2
names are in the lionel messi
format. What would a common format look like? You have several choices, but one option would be all lowercase, with just the first initial followed by a period: l. messi
.
df1 = pd.DataFrame({'names': ['L. Messi'], 'x': [1]})
df2 = pd.DataFrame({'names': ['lionel messi'], 'y': [2]})
df1.names = df1.names.str.lower()
df2.names = df2.names.apply(lambda n: n[0] + '.' + n[n.find(' '):])
df = df1.merge(df2, left_on='names', right_on='names')
*Note: This approach is totally dependent on the names being "matchable" in this way. There are plenty of cases that could cause this simple approach to fail. If a team has two members, Abby Wambach
and Aaron Wambach
, they'll both look like a. wambach
. If one dataframe tries to differentiate them by using other initials in their name, like m.a. wambach
and a.k. wambach
, the naive matching will fail. How you handle this depends on the size your data - maybe you can try match most players this way, and see who gets dropped, and write custom code for them.
Upvotes: 1