glpsx
glpsx

Reputation: 679

Comparing abbreviated words in pandas

I'm experimenting/learning Python with a data set containing information on companies.

The DataFrame structure is the following (these are made up records):

import pandas as pd

df = pd.DataFrame({'key': [111, 222, 333, 444, 555, 666, 777, 888, 999], 
                   'left_name' : ['ET CETERA SYSTEMS', 'ODDS AND ENDS', 'MAXIMA COMPANY', 'MUSIC MANY', 
                                  'GRAPHIC MASTER', 'ARC SECURITY', 'MINDNSOLES', 'REX ENERGY', 'THESIS COMPANY'],
                  'right_name' : ['ET CETERA SYS', 'ODDSNENDS', 'MAX COMP', 'MUSICMANY', 'GRAPHIC MSTR', 
                                  'ARC SECU', 'MIND AND SOLES', 'REXX', 'THESIS COMP']})

print(df)

   key          left_name      right_name
0  111  ET CETERA SYSTEMS   ET CETERA SYS
1  222      ODDS AND ENDS       ODDSNENDS
2  333     MAXIMA COMPANY        MAX COMP
3  444         MUSIC MANY       MUSICMANY
4  555     GRAPHIC MASTER    GRAPHIC MSTR
5  666       ARC SECURITY        ARC SECU
6  777         MINDNSOLES  MIND AND SOLES
7  888         REX ENERGY            REXX
8  999     THESIS COMPANY     THESIS COMP

My goal is to compare the acronyms of each (left_name, right_name) pair. Specifically, if the abbreviated string formed by the concatenation of the initial letters of left_name is equal to the abbreviated string formed by the concatenation of the initial letters of right_name, then return a flag of 1. Else, return 0.

For instance, if we compare the first two abbreviated pairs, then:

Visually, the resulting DataFrame I'm looking for should look like this:

   key          left_name      right_name  name_flag
0  111  ET CETERA SYSTEMS   ET CETERA SYS          1
1  222      ODDS AND ENDS       ODDSNENDS          0
2  333     MAXIMA COMPANY        MAX COMP          1
3  444         MUSIC MANY       MUSICMANY          0
4  555     GRAPHIC MASTER    GRAPHIC MSTR          1
5  666       ARC SECURITY        ARC SECU          1
6  777         MINDNSOLES  MIND AND SOLES          0
7  888         REX ENERGY            REXX          0
8  999     THESIS COMPANY     THESIS COMP          1

I think my question is closely related to this one: Upper case first letter of each word in a phrase

Unfortunately, I wasn't able to adapt the code appropriately for my problem. Any additional help would be greatly appreciated.

Upvotes: 3

Views: 484

Answers (4)

Reiner Czerwinski
Reiner Czerwinski

Reputation: 296

You get get this with

df['name_flag'] = df.apply(lambda x:x.left_name.startswith(x.right_name),axis=1).map({True:1,False:0})

Upvotes: 1

Derek Eden
Derek Eden

Reputation: 4618

def abbr(x):
    return ''.join([letter[0] for letter in x.split(' ')])

df['name_flag'] = (df['left_name'].apply(abbr) == df['right_name'].apply(abbr)).astype(int)

output:

0    1
1    0
2    1
3    0
4    1
5    1
6    0
7    0
8    1


''.join(re.findall(r'^[A-Z]|\s[A-Z]',s)).replace(' ','')

or

''.join(re.findall(r'\b\w',s))

also works in the function

Upvotes: 3

Andy L.
Andy L.

Reputation: 25239

Try this:

l = df.left_name.str.findall(r'\b\w')
r = df.right_name.str.findall(r'\b\w')
df['name_flag'] = (l == r).astype(int)

Out[366]:
   key          left_name      right_name  name_flag
0  111  ET CETERA SYSTEMS   ET CETERA SYS          1
1  222      ODDS AND ENDS       ODDSNENDS          0
2  333     MAXIMA COMPANY        MAX COMP          1
3  444         MUSIC MANY       MUSICMANY          0
4  555     GRAPHIC MASTER    GRAPHIC MSTR          1
5  666       ARC SECURITY        ARC SECU          1
6  777         MINDNSOLES  MIND AND SOLES          0
7  888         REX ENERGY            REXX          0
8  999     THESIS COMPANY     THESIS COMP          1

Upvotes: 3

Shiva
Shiva

Reputation: 2838

This will do the job

def get_acronym(phrase):
    words = phrase.split(' ')
    return ''.join(w[0] for w in words)

df['name_flag'] = df.right_name.map(get_acronym) == df.left_name.map(get_acronym)
df['name_flag'] = df['name_flag'].astype(int)

df output

   key          left_name      right_name  name_flag
0  111  ET CETERA SYSTEMS   ET CETERA SYS          1
1  222      ODDS AND ENDS       ODDSNENDS          0
2  333     MAXIMA COMPANY        MAX COMP          1
3  444         MUSIC MANY       MUSICMANY          0
4  555     GRAPHIC MASTER    GRAPHIC MSTR          1
5  666       ARC SECURITY        ARC SECU          1
6  777         MINDNSOLES  MIND AND SOLES          0
7  888         REX ENERGY            REXX          0
8  999     THESIS COMPANY     THESIS COMP          1

Upvotes: 3

Related Questions