Reputation: 175
I have the following dictionary
dictionary = {"car":1234, "light-blue":112, "orange":34, "blue":1, "cargo yellow":35}
And the following dataframe
df_data = {"sentence": ["the summarine is a cargo yellow and orange", "the sky was an amazing light-blue, you should have seen it", "the grass is green", "why you face is purple?", "Light blue as you! HAHA", "Have you ever use the Jungle exploration?"], "extra":['a','b','c','d', 'e', 'f'] }
df = pd.DataFrame(df_data)
Based on a previous question I have made, I was using this code:
df['new_col'] = df.sentence.str.extract(pat = f"({'|'.join(dictionary.keys())})")[0]
But I have two problems: the first is that in case I have multiple dictionary keys in the sentence, is not able to extract it; the second which is that it retrieves the word car even if is not present. To solve the first problem I used the following code:
df.sentence.str.findall(f"|".join(dictionary.keys())).apply(", ".join)
Which result in this:
0 car, orange
1 light-blue
2
3
4 blue
5
But still, I have the problem with car and in this case also with blue. What instead, I would like to have is this:
0 cargo yellow, orange
1 light-blue
2 nan
3 nan
4 nan
5 nan
Furthermore do you have any suggestion in how I could change the code to have this result instead:
0 cargo yellow, orange
1 light-blue
2 nan
3 nan
4 light blue
5 nan
EDIT: I have tried the following code:
for i in dictionary.keys():
print(i,"\n",df.sentence.str.findall(rf'\b\W?{i}\W?\b'))
And in this case, the key 'car' is not retrieved but is not very efficient considering that my dictionary has 3000 key/value.
Thank you!
Upvotes: 0
Views: 378
Reputation: 120439
Prepare input data:
data = {27695: 'I am legit happy because of these news.',
143703: "Something seems suspicious.I can't recognize that...",
...
48645: 'lol.',
185265: 'Maybe there is a chance the infinity war sets are next'}
df = pd.DataFrame({"sentence": data.values()}, index=data.keys())
Find full words:
out = df.sentence.str.findall('|'.join([fr'\b{w}\b' for w in dictionary.keys()])) \
.apply(lambda l: ','.join(l))
Display only matched results:
>>> out[out != '']
56082 blue,orange
134801 blue
102078 blue
106617 blue
204968 blue
139796 cargo yellow
Name: sentence, dtype: object
Upvotes: 1