Reputation: 317
**** UPDATED QUESTION ****
I have these two dataframes:
DF1
id type
0 "car"
1 "travel"
2 "sport"
3 "Cleaning-bike"
4 "Build house"
5 "test 32 sport foot"
DF2
sentence_id sentence
0 "I love cars"
1 "I don't like traveling"
2 "I don't do sport and travel"
3 "I am on vacation"
4 "My bik needs more attention"
5 "I want a house"
6 "I know a lot about football"
I would to see for each row in DF1 type column, if this word appears in any row of DF2, create a new column 'category' in DF2 containing the word appearing in the row (list of words if several match like ['sport', 'travel']. And 'no match if nothing found from DF1.
How can I do this ? Without doing for each loop with contains() method. DF1 has thousands rows and DF2 millions.
Sometimes type of DF1 doesn't match totally with sentence, but one of the category word is contained in the sentence (exemple "foot" is in the sentence with id 6) And also sometimes the sentence doesn't contains the total word of a type (exemple with 'bik').
Expected output:
sentence_id sentence category
0 'I love cars' 'car'
1 "I don't like traveling" 'travel'
2 "I don't do sport" ['sport', 'travel']
3 'I am on vacation' 'no match'
4 "My bik needs more attention" "Cleaning-bike"
5 "I want a house" "Build house"
6 "I know a lot about football" "test 32 sport foot"
Upvotes: 1
Views: 1461
Reputation: 862601
You can get all lists with Series.str.findall
:
DF2['category'] = DF2['sentence'].str.findall('|'.join(DF1['type'].str.strip("'")))
print (DF2)
sentence_id sentence category
0 0 'I love cars' [car]
1 1 'I don't like traveling' [travel]
2 2 'I don't do sport and travel' [sport, travel]
3 3 'I am on vacation' []
If need also scalars if length is 1
and custom string if empty string add custom function:
f = lambda x: x[0] if len(x) == 1 else 'no match' if len(x) == 0 else x
DF2['category'] = DF2['sentence'].str.findall('|'.join(DF1['type'].str.strip("'"))).apply(f)
print (DF2)
sentence_id sentence category
0 0 'I love cars' car
1 1 'I don't like traveling' travel
2 2 'I don't do sport and travel' [sport, travel]
3 3 'I am on vacation' no match
EDIT: Create dictionary with split values by -
or space in DF1['type']
and match it in custom function:
s = DF1['type'].str.strip("'")
s = pd.Series(s.to_numpy(), index=s).str.split('-|\s+').explode().str.lower()
d = {v: k for k, v in s.items()}
print (d)
{'car': 'car',
'travel': 'travel',
'sport': 'sport',
'cleaning': 'Cleaning-bike',
'bike': 'Cleaning-bike',
'build': 'Build house',
'house': 'Build house'}
pat = '|'.join(s)
def f(x):
out = [d.get(y, y) for y in x]
if len(out) == 1:
return out[0]
elif not bool(x):
return 'no match'
else:
return out
DF2['category'] = DF2['sentence'].str.findall(pat).apply(f)
print (DF2)
sentence_id sentence category
0 0 'I love cars' car
1 1 'I don't like traveling' travel
2 2 'I don't do sport and travel' [sport, travel]
3 3 'I am on vacation' no match
4 4 'My bike needs more attention' Cleaning-bike
5 5 'I want a house' Build house
Upvotes: 2