Rodrigo
Rodrigo

Reputation: 69

Python Pandas Dataframe: add new column based on existing column, which contains lists of lists

I am trying to add a column to the dataframe below, that would tell me if a person belongs to the category Green or not. It would just show Y or N, depending on whether the column category contains it, for that person. The problem is that the column category contains in some lines just a string, and in other a list of strings and even on others a list of lists.


import pandas as pd

df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 
                   'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})

How can I make it so that I get to see if the column, for each row, contains the specific 'Green' string?

Thank you.

Upvotes: 0

Views: 834

Answers (3)

RJ Adriaansen
RJ Adriaansen

Reputation: 9639

Although I would agree that basic string matching serves the purpose of the question, I would like to draw attention to the fact that flattening lists can be achieved quite easily with pd.core.common.flatten:

import pandas as pd
import ast

df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice', 'John'], 
                   'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown'], None]})

def fix_list(text):
    try:
        if '[' in text:
            text = ast.literal_eval(text)
        else: 
            text = [text]
    except:
        text = []
    return list(pd.core.common.flatten(text))
    
df['category'] = df['category'].apply(fix_list)
df['green'] = df['category'].apply(lambda x: 'green' in x)

Result:

user category green
0 Bob ['green', 'red'] True
1 Jane ['blue'] False
2 Theresa ['green'] True
3 Alice ['yellow', 'purple', 'green', 'brown'] True
4 John [] False

Upvotes: 1

cs95
cs95

Reputation: 402844

I would not bother flattening the list, just use basic string matching:

df['category'].astype(str).str.contains(r'\bgreen\b')

0     True
1    False
2     True
3     True
Name: category, dtype: bool

Add the word boundary check \b so we don't accidentally match words like "greenery" or "greenwich" which have "green" as part of a larger word.


df.assign(has_green=df['category'].astype(str)
                                  .str.contains(r'\bgreen\b')
                                  .map({True: 'Y', False: 'N'}))

      user                          category has_green
0      Bob                  [[green], [red]]         Y
1     Jane                              blue         N
2  Theresa                           [green]         Y
3    Alice  [[yellow, purple], green, brown]         Y

Upvotes: 3

Avi Thaker
Avi Thaker

Reputation: 453

You need to use a recursive flatten.

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})

def flatten(x):
    rt = []
    for i in x:
        if isinstance(i,list): rt.extend(flatten(i))
        else: rt.append(i)
    return rt

def is_green(x):
    flat_list = flatten(x)
    if "green" in flat_list:
        return True
    else:
        return False

df["has_green"] = df["category"].apply(lambda x: is_green(x))

print(df)
      user                          category  has_green
0      Bob                  [[green], [red]]       True
1     Jane                              blue      False
2  Theresa                           [green]       True
3    Alice  [[yellow, purple], green, brown]       True

Upvotes: 1

Related Questions