Andy
Andy

Reputation: 33

Split lists into multiple columns in a pandas DataFrame

I have a source system that gives me data like this:

Name    |Hobbies
----------------------------------
"Han"   |"Art;Soccer;Writing"
"Leia"  |"Art;Baking;Golf;Singing"
"Luke"  |"Baking;Writing"

Each hobby list is semicolon delimited. I want to turn this into a table like structure with a column for each hobby and a flag to indicate if a person selected that hobby:

Name    |Art     |Baking  |Golf    |Singing |Soccer  |Writing  
--------------------------------------------------------------
"Han"   |1       |0       |0       |0       |1       |1
"Leia"  |1       |1       |1       |1       |0       |0
"Luke"  |0       |1       |0       |0       |0       |1

Here's code to generate the sample data in a pandas dataframe:

>>> import pandas as pd
>>> df = pd.DataFrame(
...     [
...         {'name': 'Han',   'hobbies': 'Art;Soccer;Writing'},
...         {'name': 'Leia',  'hobbies': 'Art;Baking;Golf;Singing'},
...         {'name': 'Luke',  'hobbies': 'Baking;Writing'},
...     ]
... )
>>> df
                   hobbies  name
0       Art;Soccer;Writing   Han
1  Art;Baking;Golf;Singing  Leia
2           Baking;Writing  Luke

Right now, I'm using the following code to get the data into a datatrame that has the structure I want, but it is really slow (my actual data set has about 1.5 million rows):

>>> df2 = pd.DataFrame(columns=['name', 'hobby'])
>>>
>>> for index, row in df.iterrows():
...     for value in str(row['hobbies']).split(';'):
...         d = {'name':row['name'], 'value':value}
...         df2 = df2.append(d, ignore_index=True)
...
>>> df2 = df2.groupby('name')['value'].value_counts()
>>> df2 = df2.unstack(level=-1).fillna(0)
>>>
>>> df2
value  Art  Baking  Golf  Singing  Soccer  Writing
name
Han    1.0     0.0   0.0      0.0     1.0      1.0
Leia   1.0     1.0   1.0      1.0     0.0      0.0
Luke   0.0     1.0   0.0      0.0     0.0      1.0

Is there a more efficient way to do this?

Upvotes: 3

Views: 6137

Answers (3)

Georgy
Georgy

Reputation: 13717

Actually, using .str.split and .melt should be slighter faster then looping with iterrows.

  1. Splitting to multiple columns:

    >>> df = pd.DataFrame([{'name': 'Han', 'hobbies': 'Art;Soccer;Writing'}, 
                           {'name': 'Leia', 'hobbies': 'Art;Baking;Golf;Singing'},
                           {'name': 'Luke', 'hobbies': 'Baking;Writing'}])
    >>> hobbies = df['hobbies'].str.split(';', expand=True)
    >>> hobbies
        0          1       2       3
    0 Art     Soccer Writing    None
    1 Art     Baking    Golf Singing
    2 Baking Writing    None    None 
    
  2. Unpivoting hobbies by names:

    >>> df = df.drop('hobbies', axis=1)
    >>> df = df.join(hobbies)
    >>> stacked = df.melt('name', value_name='hobby').drop('variable', axis=1)
    >>> stacked
       name   hobby
     0  Han     Art
     1 Leia     Art
     2 Luke  Baking
     3  Han  Soccer
     4 Leia  Baking
     5 Luke Writing
     6  Han Writing
     7 Leia    Golf
     8 Luke    None
     9  Han    None
    10 Leia Singing
    11 Luke    None
    
  3. Counting the values:

    >>> counts = stacked.groupby('name')['hobby'].value_counts()
    >>> result = counts.unstack(level=-1).fillna(0).astype(int)
    >>> result
    hobby Art Baking Golf Singing Soccer Writing
    name                        
     Han    1      0    0       0      1       1
    Leia    1      1    1       1      0       0
    Luke    0      1    0       0      0       1
    

There are alternatives to steps 2 and 3, like using get_dummies or crosstab, as discussed here: Pandas get_dummies on multiple columns, but the first one will eat your memory, and the second one is much slower.


References:
Pandas split column into multiple columns by comma
Pandas DataFrame stack multiple column values into single column

Upvotes: 1

Georgy
Georgy

Reputation: 13717

What you could do is instead of appending columns on every iteration append all of them after running your loop:

df3 = pd.DataFrame(columns=['name', 'hobby'])
d_list = []

for index, row in df.iterrows():
    for value in str(row['hobbies']).split(';'):
        d_list.append({'name':row['name'], 
                       'value':value})
df3 = df3.append(d_list, ignore_index=True)
df3 = df3.groupby('name')['value'].value_counts()
df3 = df3.unstack(level=-1).fillna(0)
df3

I checked how much time it would take for you example dataframe. With the improvement I suggest it's ~50 times faster.

Upvotes: 2

havanagrawal
havanagrawal

Reputation: 1039

Why not just change the DataFrame in place?

for idx, row in df.iterrows():
    for hobby in row.hobbies.split(";"):
        df.loc[idx, hobby] = True

df.fillna(False, inplace=True)

Upvotes: 3

Related Questions