appleLover
appleLover

Reputation: 15691

Pythonic Name Matching

I have a database with names of football teams, (for instance in the first entry below, Marshall and Southern Methodist). Then, matched up with my database names are some different, yet recognizable names (in the first entry below, SMU, Marshall):

[u'Houston', u'Alabama']
[u'Houst', u'Alab']


[u'Florida State', u'North Carolina State']
[u'NCSt', u'FlaSt']


[u'Penn State', u'Iowa']
[u'PnSt', u'Iowa']


[u'Oklahoma', u'Texas']
[u'Texas', u'Okla']


[u'Florida Atlantic', u'South Florida']
[u'SFla', u'FlAtl']


[u'Georgia', u'Tennessee']
[u'Geo', u'Tenn']


[u'San Jose State', u'Idaho']
[u'UI', u'SJSU']


[u'Washington State', u'Arizona State']
[u'ArzSt', u'WshSt']


[u'Fresno State', u'Nevada']
[u'Nevad', u'FrsSt']


[u'Oregon State', u'Arizona']
[u'ARIZ', u'OSU']


[u'Clemson', u'Virginia Tech']
[u'VTech', u'Clem']


[u'Chattanooga', u'Arkansas']
[u'UTC', u'AR']


[u'USC', u'Stanford']
[u'USC', u'Stanf']


[u'Baylor', u'Colorado']
[u'BU', u'CU']


[u'North Texas', u'Louisiana-Lafayette']
[u'NoTex', u'LaLaf']


[u'Tulane', u'Army']
[u'TLN', u'ARMY']


[u'Troy', u'Florida International']
[u'TROY', u'FIU']


[u'Louisiana-Monroe', u'Arkansas State']
[u'ASU', u'ULM']


[u'Texas Tech', u'Iowa State']
[u'TT', u'ISU']


[u'Akron', u'Western Michigan']
[u'AKRON', u'WMU']


[u'Liberty', u'Toledo']
[u'LIBERTY', u'TOLEDO']


[u'Virginia', u'Middle Tennessee']
[u'Virg', u'MTnSt']


[u'Oklahoma State', u'Texas A&M']
[u'TexAM', u'OKSt']


[u'Notre Dame', u'UCLA']
[u'NDame', u'UCLA']


[u'Rutgers', u'Cincinnati']
[u'Cincy', u'Rutgr']


[u'Ohio State', u'Purdue']
[u'Prdue', u'OhSt']


[u'LSU', u'Florida']
[u'Fla', u'LSU']


[u'Air Force', u'UNLV']
[u'AFA', u'UNLV']


[u'Nebraska', u'Missouri']
[u'Misso', u'Neb']


[u'New Mexico State', u'Boise State']
[u'NMxSt', u'BoiSt']


[u'Pittsburgh', u'Navy']
[u'Navy', u'Pitt']


[u'Wake Forest', u'Florida State']
[u'WFrst', u'FlaSt']


[u'San Jose State', u'Hawaii']
[u'Hawa', u'SJSt']


[u'UCF', u'South Florida']
[u'UCF', u'SFla']

For each group of four names, I need to match my database name to the correct new name. I could do this right now using a lot of if statements, but it would take a lot of code, and wouldn't be particularly elegant. Is there a better way to match here?

Upvotes: 0

Views: 404

Answers (3)

eyquem
eyquem

Reputation: 27575

from difflib import SequenceMatcher

li = [\
([u'Houston', u'Alabama'],
 [u'Houst', u'Alab']),

([u'Florida State', u'North Carolina State'],
 [u'NCSt', u'FlaSt']),

([u'Penn State', u'Iowa'],
 [u'PnSt', u'Iowa']),

([u'Oklahoma', u'Texas'],
 [u'Texas', u'Okla']),

([u'Florida Atlantic', u'South Florida'],
 [u'SFla', u'FlAtl']),

([u'Georgia', u'Tennessee'],
 [u'Geo', u'Tenn']),

([u'San Jose State', u'Idaho'],
 [u'UI', u'SJSU']),

([u'Washington State', u'Arizona State'],
 [u'ArzSt', u'WshSt']),

([u'Fresno State', u'Nevada'],
 [u'Nevad', u'FrsSt']),

([u'Oregon State', u'Arizona'],
 [u'ARIZ', u'OSU']),

([u'Clemson', u'Virginia Tech'],
 [u'VTech', u'Clem']),

([u'Chattanooga', u'Arkansas'],
 [u'UTC', u'AR']),

([u'USC', u'Stanford'],
 [u'USC', u'Stanf']),

([u'Baylor', u'Colorado'],
 [u'BU', u'CU']),

([u'North Texas', u'Louisiana-Lafayette'],
 [u'NoTex', u'LaLaf']),

([u'Tulane', u'Army'],
 [u'TLN', u'ARMY']),

([u'Troy', u'Florida International'],
 [u'TROY', u'FIU']),

([u'Louisiana-Monroe', u'Arkansas State'],
 [u'ASU', u'ULM']),

([u'Texas Tech', u'Iowa State'],
 [u'TT', u'ISU']),

([u'Akron', u'Western Michigan'],
 [u'AKRON', u'WMU']),

([u'Liberty', u'Toledo'],
 [u'LIBERTY', u'TOLEDO']),

([u'Virginia', u'Middle Tennessee'],
 [u'Virg', u'MTnSt']),

([u'Oklahoma State', u'Texas A&M'],
 [u'TexAM', u'OKSt']),

([u'Notre Dame', u'UCLA'],
 [u'NDame', u'UCLA']),

([u'Rutgers', u'Cincinnati'],
 [u'Cincy', u'Rutgr']),

([u'Ohio State', u'Purdue'],
 [u'Prdue', u'OhSt']),

([u'LSU', u'Florida'],
 [u'Fla', u'LSU']),

([u'Air Force', u'UNLV'],
 [u'AFA', u'UNLV']),

([u'Nebraska', u'Missouri'],
 [u'Misso', u'Neb']),

([u'New Mexico State', u'Boise State'],
 [u'NMxSt', u'BoiSt']),

([u'Pittsburgh', u'Navy'],
 [u'Navy', u'Pitt']),

([u'Wake Forest', u'Florida State'],
 [u'WFrst', u'FlaSt']),

([u'San Jose State', u'Hawaii'],
 [u'Hawa', u'SJSt']),

([u'UCF', u'South Florida'],
 [u'UCF', u'SFla']) ]


def comp(N,D,sq = SequenceMatcher(None)):
    sq.set_seqs(N[0],D[0])
    a = sq.ratio()
    sq.set_seqs(N[1],D[1])
    b = sq.ratio()
    
    sq.set_seqs(N[0],D[1])
    x = sq.ratio()
    sq.set_seqs(N[1],D[0])
    y = sq.ratio()

    if a>x and b>y:
        return (N[0],D[0]), (N[1],D[1])
    else:
        return (N[0],D[1]),(N[1],D[0])


print '\n'.join('%-30s   %s' % comp(N,D) for N,D in li)

result

(u'Houston', u'Houst')             (u'Alabama', u'Alab')
(u'Florida State', u'FlaSt')       (u'North Carolina State', u'NCSt')
(u'Penn State', u'PnSt')           (u'Iowa', u'Iowa')
(u'Oklahoma', u'Okla')             (u'Texas', u'Texas')
(u'Florida Atlantic', u'FlAtl')    (u'South Florida', u'SFla')
(u'Georgia', u'Geo')               (u'Tennessee', u'Tenn')
(u'San Jose State', u'SJSU')       (u'Idaho', u'UI')
(u'Washington State', u'WshSt')    (u'Arizona State', u'ArzSt')
(u'Fresno State', u'FrsSt')        (u'Nevada', u'Nevad')
(u'Oregon State', u'OSU')          (u'Arizona', u'ARIZ')
(u'Clemson', u'Clem')              (u'Virginia Tech', u'VTech')
(u'Chattanooga', u'UTC')           (u'Arkansas', u'AR')
(u'USC', u'USC')                   (u'Stanford', u'Stanf')
(u'Baylor', u'BU')                 (u'Colorado', u'CU')
(u'North Texas', u'NoTex')         (u'Louisiana-Lafayette', u'LaLaf')
(u'Tulane', u'TLN')                (u'Army', u'ARMY')
(u'Troy', u'TROY')                 (u'Florida International', u'FIU')
(u'Louisiana-Monroe', u'ULM')      (u'Arkansas State', u'ASU')
(u'Texas Tech', u'TT')             (u'Iowa State', u'ISU')
(u'Akron', u'AKRON')               (u'Western Michigan', u'WMU')
(u'Liberty', u'TOLEDO')            (u'Toledo', u'LIBERTY')
(u'Virginia', u'Virg')             (u'Middle Tennessee', u'MTnSt')
(u'Oklahoma State', u'OKSt')       (u'Texas A&M', u'TexAM')
(u'Notre Dame', u'NDame')          (u'UCLA', u'UCLA')
(u'Rutgers', u'Rutgr')             (u'Cincinnati', u'Cincy')
(u'Ohio State', u'OhSt')           (u'Purdue', u'Prdue')
(u'LSU', u'LSU')                   (u'Florida', u'Fla')
(u'Air Force', u'AFA')             (u'UNLV', u'UNLV')
(u'Nebraska', u'Neb')              (u'Missouri', u'Misso')
(u'New Mexico State', u'NMxSt')    (u'Boise State', u'BoiSt')
(u'Pittsburgh', u'Pitt')           (u'Navy', u'Navy')
(u'Wake Forest', u'WFrst')         (u'Florida State', u'FlaSt')
(u'San Jose State', u'SJSt')       (u'Hawaii', u'Hawa')
(u'UCF', u'UCF')                   (u'South Florida', u'SFla')

.

EDIT

from difflib import SequenceMatcher

li = [\
 ([u'Liberty', u'Toledo'], #######
 [u'LIBERTY', u'TOLEDO']),

([u'Chattanooga', u'Arkansas'], ################
 [u'UTC', u'AR']),

([u'Texas Tech', u'Iowa State'], ###########
 [u'TT', u'ISU'])  ]


def comp(N,D,sq = SequenceMatcher(None)):
    sq.set_seqs(N[0],D[0])
    a = sq.ratio()
    sq.set_seqs(N[1],D[1])
    b = sq.ratio()
    
    sq.set_seqs(N[0],D[1])
    x = sq.ratio()
    sq.set_seqs(N[1],D[0])
    y = sq.ratio()

    sq.set_seqs(N[0].lower(),D[0].lower())
    al = sq.ratio()
    sq.set_seqs(N[1].lower(),D[1].lower())
    bl = sq.ratio()
    
    sq.set_seqs(N[0].lower(),D[1].lower())
    xl = sq.ratio()
    sq.set_seqs(N[1].lower(),D[0].lower())
    yl = sq.ratio()

    return ((N[0],D[0]), (N[1],D[1]),
            a,b,a*b,a+b,
            (N[0].lower(),D[0].lower()), (N[1].lower(),D[1].lower()),
            al,bl,al*bl,al+bl,
            (N[0],D[1]),(N[1],D[0]),
            x,y,x*y,x+y,
            (N[0].lower(),D[1].lower()),(N[1].lower(),D[0].lower()),
            xl,yl,xl*yl,xl+yl)

print '\n'.join(('====='*14)+ '\n'
                '%-25s   %s\n'
                '    %-10f                  %f       -->   x%f  +%f\n'
                '%-25s   %s\n'
                '    %-10f                  %f       -->   x%f  +%f\n\n'
                '%-25s   %s\n'
                '    %-10f                  %f       -->   x%f  +%f\n'
                '%-25s   %s\n'
                '    %-10f                  %f       -->   x%f  +%f\n'
                % comp(N,D) for N,D in li)

result

======================================================================
(u'Liberty', u'LIBERTY')    (u'Toledo', u'TOLEDO')
    0.142857                    0.166667       -->   x0.023810  +0.309524
(u'liberty', u'liberty')    (u'toledo', u'toledo')
    1.000000                    1.000000       -->   x1.000000  +2.000000

(u'Liberty', u'TOLEDO')     (u'Toledo', u'LIBERTY')
    0.153846                    0.153846       -->   x0.023669  +0.307692
(u'liberty', u'toledo')     (u'toledo', u'liberty')
    0.307692                    0.153846       -->   x0.047337  +0.461538

======================================================================
(u'Chattanooga', u'UTC')    (u'Arkansas', u'AR')
    0.142857                    0.200000       -->   x0.028571  +0.342857
(u'chattanooga', u'utc')    (u'arkansas', u'ar')
    0.142857                    0.400000       -->   x0.057143  +0.542857

(u'Chattanooga', u'AR')     (u'Arkansas', u'UTC')
    0.000000                    0.000000       -->   x0.000000  +0.000000
(u'chattanooga', u'ar')     (u'arkansas', u'utc')
    0.153846                    0.000000       -->   x0.000000  +0.153846

======================================================================
(u'Texas Tech', u'TT')      (u'Iowa State', u'ISU')
    0.333333                    0.307692       -->   x0.102564  +0.641026
(u'texas tech', u'tt')      (u'iowa state', u'isu')
    0.333333                    0.307692       -->   x0.102564  +0.641026

(u'Texas Tech', u'ISU')     (u'Iowa State', u'TT')
    0.000000                    0.000000       -->   x0.000000  +0.000000
(u'texas tech', u'isu')     (u'iowa state', u'tt')
    0.153846                    0.333333       -->   x0.051282  +0.487179

This above result of code shows:

1/
The ratio a = 0.142857 of association (u'Liberty', u'LIBERTY') is inferior to ratio x = 0.153846 of association (u'Liberty', u'TOLEDO') !
This is enough to make the condition a>x and b>y evaluated to False and leads to the association (u'Liberty', u'TOLEDO') returned as part of the result while it is an undesired association,
and though, besides, the ratios of the associations (u'Toledo', u'TOLEDO') and (u'Toledo', u'LIBERTY') describe correctly that the first one (u'Toledo', u'TOLEDO') is the desired one.

When lower() method is applied to the string, it evidently solves the flaw since associations (u'liberty', u'liberty') and (u'toledo', u'toledo') have now ratios of 1.000000

2/
However, intervention of lower() provokes flaws for two other cases that were formerly correct.

Without lower(),
incorrect associations (u'Chattanooga', u'AR') and (u'Arkansas', u'UTC') had ratios of 0.000000
then the wining associations (u'Chattanooga', u'UTC') (u'Arkansas', u'AR') were correct result.

With lower(),
lowered correct (u'chattanooga', u'utc') has same ratio 0.142857 as unlowered version,
but compared to incorrect (u'chattanooga', u'ar') now valued to 0.153846,
it happens that correct (u'chattanooga', u'utc') is inferior to incorrect (u'chattanooga', u'ar')
So the condition is evaluated to False and then associations (u'Chattanooga', u'AR') (u'Arkansas', u'UTC') are returned while incorrect.

.

That's exactly the same with incorrect associations (u'Texas Tech', u'ISU') (u'Iowa State', u'TT') that have ratios 0.000000 inferior to ratio 0.333333 and 0.307692 of correct associations (u'Texas Tech', u'TT') (u'Iowa State', u'ISU')

When lowered,
the ratio of (u'iowa state', u'tt') increases from 0.000000 to 0.333333 while the other association (u'iowa state', u'isu') remains to the same inferior ratio 0.307692. So the condition is again evaluated to False.

3/
It is clear that the new flaws are due to the fact that labels u'AR' and u'TT' are very short. Only one or two lowered letters that match with long lowered names 'chattanooga' and u'texas tech' while there was no match between the unlowered versions of these strings, and the situation is tumbled.

It is also clear that the problems emerge because my boolean expression a>x and b>y gives a lot of weight to each of the two expressions a>x and b>y separately.
I consider that must be found a condition that combines the result of a>x and the result of b>y
Multiplying them together doesn't give me the impression that it's a good way.
In the following code, I choosed to add the ratios and to perform more than only one evaluation of condition.

from difflib import SequenceMatcher

li = [\
([u'Houston', u'Alabama'],
 [u'Houst', u'Alab']),

([u'Florida State', u'North Carolina State'],
 [u'NCSt', u'FlaSt']),

([u'Penn State', u'Iowa'],
 [u'PnSt', u'Iowa']),

([u'Oklahoma', u'Texas'],
 [u'Texas', u'Okla']),

([u'Florida Atlantic', u'South Florida'],
 [u'SFla', u'FlAtl']),

([u'Georgia', u'Tennessee'],
 [u'Geo', u'Tenn']),

([u'San Jose State', u'Idaho'],
 [u'UI', u'SJSU']),

([u'Washington State', u'Arizona State'],
 [u'ArzSt', u'WshSt']),

([u'Fresno State', u'Nevada'],
 [u'Nevad', u'FrsSt']),

([u'Oregon State', u'Arizona'],
 [u'ARIZ', u'OSU']),

([u'Clemson', u'Virginia Tech'],
 [u'VTech', u'Clem']),

([u'Chattanooga', u'Arkansas'],
 [u'UTC', u'AR']),

([u'USC', u'Stanford'],
 [u'USC', u'Stanf']),

([u'Baylor', u'Colorado'],
 [u'BU', u'CU']),

([u'North Texas', u'Louisiana-Lafayette'],
 [u'NoTex', u'LaLaf']),

([u'Tulane', u'Army'],
 [u'TLN', u'ARMY']),

([u'Troy', u'Florida International'],
 [u'TROY', u'FIU']),

([u'Louisiana-Monroe', u'Arkansas State'],
 [u'ASU', u'ULM']),

([u'Texas Tech', u'Iowa State'],
 [u'TT', u'ISU']),

([u'Akron', u'Western Michigan'],
 [u'AKRON', u'WMU']),

([u'Liberty', u'Toledo'],
 [u'LIBERTY', u'TOLEDO']),

([u'Virginia', u'Middle Tennessee'],
 [u'Virg', u'MTnSt']),

([u'Oklahoma State', u'Texas A&M'],
 [u'TexAM', u'OKSt']),

([u'Notre Dame', u'UCLA'],
 [u'NDame', u'UCLA']),

([u'Rutgers', u'Cincinnati'],
 [u'Cincy', u'Rutgr']),

([u'Ohio State', u'Purdue'],
 [u'Prdue', u'OhSt']),

([u'LSU', u'Florida'],
 [u'Fla', u'LSU']),

([u'Air Force', u'UNLV'],
 [u'AFA', u'UNLV']),

([u'Nebraska', u'Missouri'],
 [u'Misso', u'Neb']),

([u'New Mexico State', u'Boise State'],
 [u'NMxSt', u'BoiSt']),

([u'Pittsburgh', u'Navy'],
 [u'Navy', u'Pitt']),

([u'Wake Forest', u'Florida State'],
 [u'WFrst', u'FlaSt']),

([u'San Jose State', u'Hawaii'],
 [u'Hawa', u'SJSt']),

([u'UCF', u'South Florida'],
 [u'UCF', u'SFla']) ]

def comp(N,D,sq = SequenceMatcher(None)):
    sq.set_seqs(N[0],D[0])
    a = sq.ratio()
    sq.set_seqs(N[1],D[1])
    b = sq.ratio()
    
    sq.set_seqs(N[0],D[1])
    x = sq.ratio()
    sq.set_seqs(N[1],D[0])
    y = sq.ratio()

    sq.set_seqs(N[0].lower(),D[0].lower())
    al = sq.ratio()
    sq.set_seqs(N[1].lower(),D[1].lower())
    bl = sq.ratio()
    
    sq.set_seqs(N[0].lower(),D[1].lower())
    xl = sq.ratio()
    sq.set_seqs(N[1].lower(),D[0].lower())
    yl = sq.ratio()

    if ((a>0.5 and b>0.5 and a+b>1.4)
        or (al>0.5 and bl>0.5 and al+bl>1.4)):
        return (N[0],D[0]), (N[1],D[1])
    elif ((x>0.4 and y>0.4 and x+y>1.4)
          or (xl>0.4 and yl>0.4 and xl+yl>1.4)):
        return (N[0],D[1]), (N[1],D[0])
    elif x+y==0.0 and a+b>0.1:
        return (N[0],D[0]), (N[1],D[1])
    elif a+b==0.00 and x+y>0.1:
        return (N[0],D[1]), (N[1],D[0])
    elif a+b > x + y + 0.5:
        return (N[0],D[0]), (N[1],D[1])
    elif x+y > a + b + 0.5:
        return (N[0],D[1]), (N[1],D[0])
    elif a+b > x + y:
        return (N[0],D[0]), (N[1],D[1])
    elif x+y > a + b:
        return (N[0],D[1]), (N[1],D[0])

    
print '\n'.join('%-30s   %s' % comp(N,D) for N,D in li)

result

(u'Houston', u'Houst')           (u'Alabama', u'Alab')
(u'Florida State', u'FlaSt')     (u'North Carolina State', u'NCSt')
(u'Penn State', u'PnSt')         (u'Iowa', u'Iowa')
(u'Oklahoma', u'Okla')           (u'Texas', u'Texas')
(u'Florida Atlantic', u'FlAtl')   (u'South Florida', u'SFla')
(u'Georgia', u'Geo')             (u'Tennessee', u'Tenn')
(u'San Jose State', u'SJSU')     (u'Idaho', u'UI')
(u'Washington State', u'WshSt')   (u'Arizona State', u'ArzSt')
(u'Fresno State', u'FrsSt')      (u'Nevada', u'Nevad')
(u'Oregon State', u'OSU')        (u'Arizona', u'ARIZ')
(u'Clemson', u'Clem')            (u'Virginia Tech', u'VTech')
(u'Chattanooga', u'UTC')         (u'Arkansas', u'AR')
(u'USC', u'USC')                 (u'Stanford', u'Stanf')
(u'Baylor', u'BU')               (u'Colorado', u'CU')
(u'North Texas', u'NoTex')       (u'Louisiana-Lafayette', u'LaLaf')
(u'Tulane', u'TLN')              (u'Army', u'ARMY')
(u'Troy', u'TROY')               (u'Florida International', u'FIU')
(u'Louisiana-Monroe', u'ULM')    (u'Arkansas State', u'ASU')
(u'Texas Tech', u'TT')           (u'Iowa State', u'ISU')
(u'Akron', u'AKRON')             (u'Western Michigan', u'WMU')
(u'Liberty', u'LIBERTY')         (u'Toledo', u'TOLEDO')
(u'Virginia', u'Virg')           (u'Middle Tennessee', u'MTnSt')
(u'Oklahoma State', u'OKSt')     (u'Texas A&M', u'TexAM')
(u'Notre Dame', u'NDame')        (u'UCLA', u'UCLA')
(u'Rutgers', u'Rutgr')           (u'Cincinnati', u'Cincy')
(u'Ohio State', u'OhSt')         (u'Purdue', u'Prdue')
(u'LSU', u'LSU')                 (u'Florida', u'Fla')
(u'Air Force', u'AFA')           (u'UNLV', u'UNLV')
(u'Nebraska', u'Neb')            (u'Missouri', u'Misso')
(u'New Mexico State', u'NMxSt')   (u'Boise State', u'BoiSt')
(u'Pittsburgh', u'Pitt')         (u'Navy', u'Navy')
(u'Wake Forest', u'WFrst')       (u'Florida State', u'FlaSt')
(u'San Jose State', u'SJSt')     (u'Hawaii', u'Hawa')
(u'UCF', u'UCF')                 (u'South Florida', u'SFla')

All the result seems to be correct

Upvotes: 1

B.Mr.W.
B.Mr.W.

Reputation: 19628

Fuzzy Wuzzy is a pretty cool tool for name matching.

Here is one example:

> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
> process.extract("new york jets", choices, limit=2)
  [('New York Jets', 100), ('New York Giants', 78)]
> process.extractOne("cowboys", choices)
  ("Dallas Cowboys", 90)

More detail here

Upvotes: 1

progrenhard
progrenhard

Reputation: 2363

This is pretty impossible to do without grabbing everything TBH unless you have some indicators denoting parts of the statement. For instance.

IF

Kentucky Mississippi State
MS UK

was denoted like this

[Kentucky Mississippi State]
[MS] [UK]

It would be easy to break up and parse through.

^\[([a-zA-Z,\s]*)\](?:\n)\[([a-zA-Z,\s]*)\]

Regular expression visualization

Edit live on Debuggex

EDIT:

Just read your updated data.

^\[u\'([a-zA-Z,\s]*)\',\su\'([a-zA-Z,\s]*)'\]\n\[u\'([a-zA-Z,\s]*)\',\su\'([a-zA-Z,\s]*)\'\]$

Regular expression visualization

Edit live on Debuggex

Everything is captured in the capture groups :)

Upvotes: 1

Related Questions