user14237286
user14237286

Reputation: 109

How Do I Create New Pandas Column Based On Word In A List

So I have a list and a dataframe. I want to take the the word from the list and make it the title of the column. if the word is the row its added to the newly created column. If its not in the row leave blank or NA. Should I use iloc?

import pandas as pd
wordlist = [['this is sentence 1'],['this is sentence 2'],['this is not a sentence'],['ok who is this']]
query=['is','not']
df = pd.DataFrame(wordlist, columns = ['Name'])

for word in query:
    if word in df['Name']:
        df[word] = word
df


Output

Name                       is     not  <<column titles
0   this is sentence 1     is     NA
1   this is sentence 2     is     NA
2   this is not a sentence is     not
3   ok who is this         is     NA

Upvotes: 2

Views: 581

Answers (1)

ALollz
ALollz

Reputation: 59519

Create a search pattern then use Series.str.extractall to get the words. Then turn each unique word into a dummy and aggregate back to the original row index, and join back to the original DataFrame.

import pandas as pd

pat = f'({"|".join(query)})'
#(is|not)

df_dummies = pd.get_dummies(df['Name'].str.extractall(pat)[0]).max(level=0)

df = pd.concat([df, df_dummies], axis=1)

#                     Name  is  not
#0      this is sentence 1   1    0
#1      this is sentence 2   1    0
#2  this is not a sentence   1    1
#3          ok who is this   1    0

If instead of dummies you really want the words repeated then we can multiply the dummy DataFrame by the columns.

df_dummies = pd.get_dummies(df['Name'].str.extractall(pat)[0]).max(level=0)
df_dummies = df_dummies.mul(df_dummies.columns).replace('', np.NaN)
df = pd.concat([df, df_dummies], axis=1)

#                     Name  is  not
#0      this is sentence 1  is  NaN
#1      this is sentence 2  is  NaN
#2  this is not a sentence  is  not
#3          ok who is this  is  NaN

Finally as a word of caution the word 'this' itself contains the match 'is', and so the basic pattern above matches both to the separate word 'is' and the last two characters of 'this'. If you want to exclude matches that are parts of longer words then modify the search pattern to contain word boundaries around every element in query:

pat = '(\\b' + '\\b|\\b'.join(query) + '\\b)'
#'(\\bis\\b|\\bnot\\b)'

Upvotes: 4

Related Questions