Reputation: 109
So I have a list and a dataframe. I want to take the the word from the list and make it the title of the column. if the word is the row its added to the newly created column. If its not in the row leave blank or NA. Should I use iloc?
import pandas as pd
wordlist = [['this is sentence 1'],['this is sentence 2'],['this is not a sentence'],['ok who is this']]
query=['is','not']
df = pd.DataFrame(wordlist, columns = ['Name'])
for word in query:
if word in df['Name']:
df[word] = word
df
Output
Name is not <<column titles
0 this is sentence 1 is NA
1 this is sentence 2 is NA
2 this is not a sentence is not
3 ok who is this is NA
Upvotes: 2
Views: 581
Reputation: 59519
Create a search pattern then use Series.str.extractall
to get the words. Then turn each unique word into a dummy and aggregate back to the original row index, and join back to the original DataFrame.
import pandas as pd
pat = f'({"|".join(query)})'
#(is|not)
df_dummies = pd.get_dummies(df['Name'].str.extractall(pat)[0]).max(level=0)
df = pd.concat([df, df_dummies], axis=1)
# Name is not
#0 this is sentence 1 1 0
#1 this is sentence 2 1 0
#2 this is not a sentence 1 1
#3 ok who is this 1 0
If instead of dummies
you really want the words repeated then we can multiply the dummy DataFrame by the columns.
df_dummies = pd.get_dummies(df['Name'].str.extractall(pat)[0]).max(level=0)
df_dummies = df_dummies.mul(df_dummies.columns).replace('', np.NaN)
df = pd.concat([df, df_dummies], axis=1)
# Name is not
#0 this is sentence 1 is NaN
#1 this is sentence 2 is NaN
#2 this is not a sentence is not
#3 ok who is this is NaN
Finally as a word of caution the word 'this'
itself contains the match 'is'
, and so the basic pattern above matches both to the separate word 'is'
and the last two characters of 'this'
. If you want to exclude matches that are parts of longer words then modify the search pattern to contain word boundaries around every element in query:
pat = '(\\b' + '\\b|\\b'.join(query) + '\\b)'
#'(\\bis\\b|\\bnot\\b)'
Upvotes: 4