codelifevcd
codelifevcd

Reputation: 317

How to compare two column words values from two dataframes, and create a new column containing matching/contained words?

**** UPDATED QUESTION ****

I have these two dataframes:

DF1

id type 
0  "car"
1  "travel"
2  "sport"
3  "Cleaning-bike"
4  "Build house"
5  "test 32 sport foot"

DF2

sentence_id sentence
0           "I love cars"
1           "I don't like traveling"
2           "I don't do sport and travel"
3           "I am on vacation"
4           "My bik needs more attention"
5           "I want a house"
6           "I know a lot about football"

I would to see for each row in DF1 type column, if this word appears in any row of DF2, create a new column 'category' in DF2 containing the word appearing in the row (list of words if several match like ['sport', 'travel']. And 'no match if nothing found from DF1.

How can I do this ? Without doing for each loop with contains() method. DF1 has thousands rows and DF2 millions.

Sometimes type of DF1 doesn't match totally with sentence, but one of the category word is contained in the sentence (exemple "foot" is in the sentence with id 6) And also sometimes the sentence doesn't contains the total word of a type (exemple with 'bik').

Expected output:

sentence_id     sentence                        category
    0           'I love cars'                   'car'
    1           "I don't like traveling"        'travel'
    2           "I don't do sport"              ['sport', 'travel']
    3           'I am on vacation'              'no match'
    4           "My bik needs more attention"  "Cleaning-bike"
    5           "I want a house"                "Build house"
    6           "I know a lot about football"   "test 32 sport foot"

Upvotes: 1

Views: 1461

Answers (1)

jezrael
jezrael

Reputation: 862601

You can get all lists with Series.str.findall:

DF2['category'] = DF2['sentence'].str.findall('|'.join(DF1['type'].str.strip("'")))
print (DF2)
   sentence_id                       sentence         category
0            0                  'I love cars'            [car]
1            1       'I don't like traveling'         [travel]
2            2  'I don't do sport and travel'  [sport, travel]
3            3             'I am on vacation'               []

If need also scalars if length is 1 and custom string if empty string add custom function:

f = lambda x: x[0] if len(x) == 1 else 'no match' if len(x) == 0 else x
DF2['category'] = DF2['sentence'].str.findall('|'.join(DF1['type'].str.strip("'"))).apply(f)


print (DF2)
   sentence_id                       sentence         category
0            0                  'I love cars'              car
1            1       'I don't like traveling'           travel
2            2  'I don't do sport and travel'  [sport, travel]
3            3             'I am on vacation'         no match

EDIT: Create dictionary with split values by - or space in DF1['type'] and match it in custom function:

s = DF1['type'].str.strip("'")

s = pd.Series(s.to_numpy(), index=s).str.split('-|\s+').explode().str.lower()
d = {v: k for k, v in s.items()}
print (d)
{'car': 'car', 
 'travel': 'travel', 
 'sport': 'sport',
 'cleaning': 'Cleaning-bike',
 'bike': 'Cleaning-bike',
 'build': 'Build house', 
 'house': 'Build house'}

pat = '|'.join(s)

def f(x):
    out = [d.get(y, y) for y in x]
    if len(out) == 1:
        return out[0]
    elif not bool(x):
        return 'no match'
    else:
        return out

DF2['category'] = DF2['sentence'].str.findall(pat).apply(f)
print (DF2)
   sentence_id                        sentence         category
0            0                   'I love cars'              car
1            1        'I don't like traveling'           travel
2            2   'I don't do sport and travel'  [sport, travel]
3            3              'I am on vacation'         no match
4            4  'My bike needs more attention'    Cleaning-bike
5            5                'I want a house'      Build house

Upvotes: 2

Related Questions