Jim Eisenberg
Jim Eisenberg

Reputation: 1500

Pandas generate True/False column if term is in cell

I have a column with values listed as a string, delimited with '; ' and I would like to generate true/false columns for each separate value. The dataframe is large (100k rows, 13 values) and my current code does around 1.5 rows per second in that dataframe. Is there a dedicated pandas function that would do this much faster? Or is there an obvious mistake I'm making that slows my code down impossibly?

Data looks like:

   id      likes colour
0   1              blue
1   2  blue; green; red
2   3       red; purple
3   4      purple; blue
4   5       yellow; red
5   6            yellow

Output is:

   id      likes colour    red  yellow   blue  purple  green
0   1              blue  False   False   True   False  False
1   2  blue; green; red   True   False   True   False   True
2   3       red; purple   True   False  False    True  False
3   4      purple; blue  False   False   True    True  False
4   5       yellow; red   True    True  False   False  False
5   6            yellow  False    True  False   False  False

Current function to go from first to second frame:

def do_thing(data, col_name = 'likes colour'):
    cols = list(set([j for i in list(data[col_name]) for j in i.split('; ')]))
    for i in cols:
        data[i] = None
    for i in data.index:
        for j in cols:
            if j in data.loc[i, col_name]:
                data.loc[i, j] = True
            else:
                data.loc[i, j] = False
    return data

Basically, this first identifies each different value, then creates empty columns for each and finally fills these with True or False there or not a given word is there.

It's fine for the sample dataframe, but once I add a few more colors and multiply by 10000 it goes at a snail's pace. Help is much appreciated!

Upvotes: 0

Views: 1127

Answers (1)

Chris Adams
Chris Adams

Reputation: 18647

Try using Series.str.get_dummies and DataFrame.join:

# Setup
df = pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6}, 'likes colour': {0: 'blue', 1: 'blue; green; red', 2: 'red; purple', 3: 'purple; blue', 4: 'yellow; red', 5: 'yellow'}})

df.join(df['likes colour'].str.get_dummies(sep='; ').astype(bool))

[out]

   id      likes colour   blue  green  purple    red  yellow
0   1              blue   True  False   False  False   False
1   2  blue; green; red   True   True   False   True   False
2   3       red; purple  False  False    True   True   False
3   4      purple; blue   True  False    True  False   False
4   5       yellow; red  False  False   False   True    True
5   6            yellow  False  False   False  False    True

Upvotes: 9

Related Questions