Reputation: 13
I have two dataframes.
player_stats:
player minutes total_points assists
1 Erling Haaland 77 13 0
2 Kevin De Bruyne 90 6 1
and, season_gw1:
player position gw team
10449 Erling Håland 4 1 Manchester City
10453 Kevin De Bruyne 3 1 Manchester City
I want to merge these two dataframes by player, but as you can see, for the first player (Haaland), the word is not spelled exactly the same on both dfs.
This is the code I'm using:
season_gw1_stats = season_gw1.merge(player_stats, on = 'player')
And the resulting df (season_gw1_stats) is this one:
player position gw team minutes total_points assists
10453 Kevin de Bruyne 3 1 Manchester City 90 6 1
How do I merge dfs by similar values? (This is not the full dataframe - I also have some other names that are spelled differently in both dfs but are very similar).
Upvotes: 0
Views: 657
Reputation: 20505
In order to use standard pandas to "merge dataframes" you will pretty much need to eliminate "similar" from the problem statement. So we're faced with mapping to "matching" values that are identical across dataframes.
Here's a pair of plausible approaches.
Map variant spellings down to a smaller universe of spellings where collisions (matches) are more likely. There are many approaches:
s/sch/sh/
and s/de /de/
Cost is O(N) linear with total length of dataframes.
We wish to canonicalize, to boil down multiple variant spellings to a distinguished canonical spelling. Begin by optionally normalizing, then sort, at cost of O(N log N), and finally make a linear pass that outputs only unique names. This trivial pre-processing step reduces N, which helps a lot when dealing with O(N^2) quadratic cost.
Define a distance metric which accepts two names. When given a pair of identical names it must report a distance of zero. Otherwise it deterministically reports a positive real number. You might use Levenshtein, or MRA.
Use nested loops to compare all names against all names. If distance between names is less than threshold, arbitrarily declare name1 the winner, overwriting 2nd name with that 1st value. The effect is to cluster multiple variant spellings down to a single winning spelling.
Cost is O(N^2) quadratic.
Perhaps you're willing to tinker with the distance function a bit. You might give initial letter(s) a heavier weight, such that mismatched prefix guarantees the distance shall exceed threshold. In that case sorted names will help out, and the nested loop can be confined to just a small window of similarly prefixed names, with early termination once it sees the prefix has changed. Noting the distance between adjacent sorted names can help with manually choosing a sensible threshold parameter.
Finally, with adjusted names in hand, you're in a position to .merge() using exact equality tests.
Upvotes: 1