Reputation: 69
I am trying to add a column to the dataframe below, that would tell me if a person belongs to the category Green or not. It would just show Y or N, depending on whether the column category contains it, for that person. The problem is that the column category contains in some lines just a string, and in other a list of strings and even on others a list of lists.
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'],
'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})
How can I make it so that I get to see if the column, for each row, contains the specific 'Green' string?
Thank you.
Upvotes: 0
Views: 834
Reputation: 9639
Although I would agree that basic string matching serves the purpose of the question, I would like to draw attention to the fact that flattening lists can be achieved quite easily with pd.core.common.flatten
:
import pandas as pd
import ast
df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice', 'John'],
'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown'], None]})
def fix_list(text):
try:
if '[' in text:
text = ast.literal_eval(text)
else:
text = [text]
except:
text = []
return list(pd.core.common.flatten(text))
df['category'] = df['category'].apply(fix_list)
df['green'] = df['category'].apply(lambda x: 'green' in x)
Result:
user | category | green | |
---|---|---|---|
0 | Bob | ['green', 'red'] | True |
1 | Jane | ['blue'] | False |
2 | Theresa | ['green'] | True |
3 | Alice | ['yellow', 'purple', 'green', 'brown'] | True |
4 | John | [] | False |
Upvotes: 1
Reputation: 402844
I would not bother flattening the list, just use basic string matching:
df['category'].astype(str).str.contains(r'\bgreen\b')
0 True
1 False
2 True
3 True
Name: category, dtype: bool
Add the word boundary check \b
so we don't accidentally match words like "greenery" or "greenwich" which have "green" as part of a larger word.
df.assign(has_green=df['category'].astype(str)
.str.contains(r'\bgreen\b')
.map({True: 'Y', False: 'N'}))
user category has_green
0 Bob [[green], [red]] Y
1 Jane blue N
2 Theresa [green] Y
3 Alice [[yellow, purple], green, brown] Y
Upvotes: 3
Reputation: 453
You need to use a recursive flatten.
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})
def flatten(x):
rt = []
for i in x:
if isinstance(i,list): rt.extend(flatten(i))
else: rt.append(i)
return rt
def is_green(x):
flat_list = flatten(x)
if "green" in flat_list:
return True
else:
return False
df["has_green"] = df["category"].apply(lambda x: is_green(x))
print(df)
user category has_green
0 Bob [[green], [red]] True
1 Jane blue False
2 Theresa [green] True
3 Alice [[yellow, purple], green, brown] True
Upvotes: 1