grungrun
grungrun

Reputation: 13

How to merge two dataframes by similar (but not matching) values?

I have two dataframes.

player_stats:

    player          minutes  total_points  assists  
1   Erling Haaland  77       13            0    
2   Kevin De Bruyne 90       6             1

and, season_gw1:

        player          position  gw  team  
10449   Erling Håland   4         1   Manchester City
10453   Kevin De Bruyne 3         1   Manchester City

I want to merge these two dataframes by player, but as you can see, for the first player (Haaland), the word is not spelled exactly the same on both dfs.

This is the code I'm using:

season_gw1_stats = season_gw1.merge(player_stats, on = 'player')

And the resulting df (season_gw1_stats) is this one:

        player            position  gw  team             minutes  total_points  assists
10453   Kevin de Bruyne   3         1   Manchester City  90       6             1

How do I merge dfs by similar values? (This is not the full dataframe - I also have some other names that are spelled differently in both dfs but are very similar).

Upvotes: 0

Views: 657

Answers (1)

J_H
J_H

Reputation: 20505

In order to use standard pandas to "merge dataframes" you will pretty much need to eliminate "similar" from the problem statement. So we're faced with mapping to "matching" values that are identical across dataframes.

Here's a pair of plausible approaches.

  1. normalize each name in isolation
  2. examine quadratic pairwise distances

1. normalize

Map variant spellings down to a smaller universe of spellings where collisions (matches) are more likely. There are many approaches:

  • case smash to lower
  • map accented vowels to [aeiou]
  • discard all vowels
  • use simplifying regexes like s/sch/sh/ and s/de /de/
  • use Soundex or later competitors like Metaphone
  • manually curate a restricted vocabulary of correct spellings

Cost is O(N) linear with total length of dataframes.

2. pairwise distances

We wish to canonicalize, to boil down multiple variant spellings to a distinguished canonical spelling. Begin by optionally normalizing, then sort, at cost of O(N log N), and finally make a linear pass that outputs only unique names. This trivial pre-processing step reduces N, which helps a lot when dealing with O(N^2) quadratic cost.

Define a distance metric which accepts two names. When given a pair of identical names it must report a distance of zero. Otherwise it deterministically reports a positive real number. You might use Levenshtein, or MRA.

Use nested loops to compare all names against all names. If distance between names is less than threshold, arbitrarily declare name1 the winner, overwriting 2nd name with that 1st value. The effect is to cluster multiple variant spellings down to a single winning spelling.

Cost is O(N^2) quadratic.

Perhaps you're willing to tinker with the distance function a bit. You might give initial letter(s) a heavier weight, such that mismatched prefix guarantees the distance shall exceed threshold. In that case sorted names will help out, and the nested loop can be confined to just a small window of similarly prefixed names, with early termination once it sees the prefix has changed. Noting the distance between adjacent sorted names can help with manually choosing a sensible threshold parameter.


Finally, with adjusted names in hand, you're in a position to .merge() using exact equality tests.

Upvotes: 1

Related Questions