Reputation: 1792
I have a dataframe in which one column has text information.
print(df):
... | ... | Text |
... | ... | StringA. StringB. StringC |
... | ... | StringZ. StringY. StringX |
... | ... | StringL. StringK. StringJ |
... | ... | StringA. StringZ. StringJ |
I also have a dictionary that has the following:
dict = {'Dogs': ['StringA', 'StringL'],'Cats': ['StringB', 'StringZ', 'StringJ'],'Birds': ['StringK', 'StringY']}
EDIT: i have about 100 dictionary Keys which each have 4+ Values.
What I am hoping to do is create extra columns in the dataframe for each Key in the dictionary and then place a "1" in the column when any of the Values from the dictionary appear.
Therefore the output i am trying to get is:
print(df):
... | ... | Text | Dogs | Cats | Birds
... | ... | StringA. StringB. StringC | 1 | 1 | 0
... | ... | StringZ. StringY. StringX | 0 | 1 | 1
... | ... | StringL. StringK. StringJ | 1 | 1 | 1
... | ... | StringA. StringZ. StringJ | 1 | 1 | 0
EDIT: The issue is I'm not sure how to search for the values within the text column and then return a 1 if found to the Key column. Any help would be much appreciated! Thanks!
Upvotes: 2
Views: 149
Reputation: 3455
The answer of @Abhihek is the most efficient, but just to give another solution where you loop over df
first
import numpy as np
import pandas as pd
d = {
'Dogs': ['StringA', 'StringL'],
'Cats': ['StringB', 'StringZ', 'StringJ'],
'Birds': ['StringK', 'StringY']
}
df = pd.DataFrame({
'Text': [
'StringA. StringB. StringC',
'StringZ. StringY. StringX',
'StringL. StringK. StringJ',
'StringA. StringZ. StringJ'
]
})
for index in df.index:
for key, s_elements in d.items():
df.at[index, key] = (lambda: 1 if any([s in df['Text'][index] for s in s_elements]) else 0)()
# set the type to short integers for the columns that have been added
for key in d:
df = df.astype({key: np.uint8})
print(df.head())
Text Dogs Cats Birds
0 StringA. StringB. StringC 1 1 0
1 StringZ. StringY. StringX 0 1 1
2 StringL. StringK. StringJ 1 1 1
3 StringA. StringZ. StringJ 1 1 0
Upvotes: 1
Reputation: 2584
import pandas as pd
d = {'Dogs': ['StringA', 'StringL'],'Cats': ['StringB', 'StringZ', 'StringJ'],'Birds': ['StringK', 'StringY']}
df = pd.DataFrame({'Text': ['StringA. StringB. StringC', 'StringZ. StringY. StringX', 'StringL. StringK. StringJ',
'StringA. StringZ. StringJ']})
for k,v in d.items(): # Key, value iteration of dict
df[k] = df.apply(lambda x: 1 if any([s in x['Text'] for s in v]) else 0, axis=1)
# Apply lambda function to each row in the new column. If any of the values in the array is present in the text, its a 1
# Output
Text Dogs Cats Birds
0 StringA. StringB. StringC 1 1 0
1 StringZ. StringY. StringX 0 1 1
2 StringL. StringK. StringJ 1 1 1
3 StringA. StringZ. StringJ 1 1 0
This solution may be unoptimal if the Strings are large or there are many strings. In which case you may have to add an additional column with some sort of Trie data structure.
But the above solution should work for most moderate cases.
Upvotes: 1