Ibrahim
Ibrahim

Reputation: 175

using str.findall to retrieve the exact match from dictionary

I have the following dictionary

dictionary = {"car":1234, "light-blue":112, "orange":34, "blue":1, "cargo yellow":35}

And the following dataframe

df_data = {"sentence": ["the summarine is a cargo yellow and orange", "the sky was an amazing light-blue, you should have seen it", "the grass is green", "why you face is purple?", "Light blue as you! HAHA", "Have you ever use the Jungle exploration?"], "extra":['a','b','c','d', 'e', 'f'] }  
df = pd.DataFrame(df_data)

Based on a previous question I have made, I was using this code:

df['new_col'] = df.sentence.str.extract(pat = f"({'|'.join(dictionary.keys())})")[0]

But I have two problems: the first is that in case I have multiple dictionary keys in the sentence, is not able to extract it; the second which is that it retrieves the word car even if is not present. To solve the first problem I used the following code:

df.sentence.str.findall(f"|".join(dictionary.keys())).apply(", ".join)

Which result in this:

0    car, orange
1     light-blue
2               
3               
4           blue
5     

But still, I have the problem with car and in this case also with blue. What instead, I would like to have is this:

0  cargo yellow, orange
1            light-blue
2                   nan  
3                   nan   
4                   nan
5                   nan

Furthermore do you have any suggestion in how I could change the code to have this result instead:

0  cargo yellow, orange
1            light-blue
2                   nan  
3                   nan   
4            light blue
5                   nan

EDIT: I have tried the following code:

for i in dictionary.keys():
    print(i,"\n",df.sentence.str.findall(rf'\b\W?{i}\W?\b'))

And in this case, the key 'car' is not retrieved but is not very efficient considering that my dictionary has 3000 key/value.

Thank you!

Upvotes: 0

Views: 378

Answers (1)

Corralien
Corralien

Reputation: 120439

Prepare input data:

data = {27695: 'I am legit happy because of these news.',
        143703: "Something seems suspicious.I can't recognize that...",
        ...
        48645: 'lol.',
        185265: 'Maybe there is a chance the infinity war sets are next'}

df = pd.DataFrame({"sentence": data.values()}, index=data.keys())

Find full words:

out = df.sentence.str.findall('|'.join([fr'\b{w}\b' for w in dictionary.keys()])) \
                     .apply(lambda l: ','.join(l))

Display only matched results:

>>> out[out != '']
56082      blue,orange
134801            blue
102078            blue
106617            blue
204968            blue
139796    cargo yellow
Name: sentence, dtype: object

Upvotes: 1

Related Questions