Reputation: 23
One of the columns in my pandas dataframe contains a list. And I want to expand it and convert vertical shape like below. How to do it?
Before(code):
import pandas as pd
pd.DataFrame({
'col1':['fruit', 'veicle', 'animal'],
'col2':['apple', 'bycicle', 'cat'],
'col3':[1,4,2],
'list':[
[10, 20],
[1.2, 3.0, 2.75],
['tommy', 'tom']
]
})
Before(table):
|col1 |col2 |col3|list |
|------|-------|----|----------------|
|fruit |apple | 1|[10, 20] |
|veicle|bicycle| 4|[1.2, 3.0, 2.75]|
|animal|cat | 2|['tommy', 'tom']|
After
|col1 |col2 |col3|list |
|------|-------|----|-------|
|fruit |apple | 1|10 |
|fruit |apple | 1|20 |
|viecle|bycicle| 4|1.2 |
|viecle|bycicle| 4|3.0 |
|viecle|bycicle| 4|2.75 |
|animal|cat | 2|'tommy'|
|animal|cat | 2|'tom |
Note1: Length and type of lists is different.
Note2: I can NOT modify the code for generating datafarme.
Thank you for reading.
Upvotes: 2
Views: 1974
Reputation: 402814
Learned this cool trick from piR the other day, using np.repeat
and np.concatenate
:
idx = np.arange(len(df)).repeat(df.list.str.len(), 0)
out = df.iloc[idx, :-1].assign(list=np.concatenate(df.list.values))
print(out)
col1 col2 col3 list
0 fruit apple 1 10
0 fruit apple 1 20
1 veicle bycicle 4 1.2
1 veicle bycicle 4 3.0
1 veicle bycicle 4 2.75
2 animal cat 2 tommy
2 animal cat 2 tom
Performance
# Bharath
%timeit df.set_index(['col1','col2','col3']['list'].apply(pd.Series).stack()\
.reset_index().drop('level_3',axis=1)
100 loops, best of 3: 7.75 ms per loop
# Mine
%%timeit
idx = np.arange(len(df)).repeat(df.list.str.len(), 0)
out = df.iloc[idx, :-1].assign(list=np.concatenate(df.list.values))
1000 loops, best of 3: 1.41 ms per loop
df_test = pd.concat([df] * 10000)
# Bharath
%timeit df_test.set_index(['col1','col2','col3'])['list'].apply(pd.Series).stack()\
.reset_index().drop('level_3',axis=1)
1 loop, best of 3: 7.09 s per loop
# Mine
%%timeit
idx = np.arange(len(df_test)).repeat(df_test.list.str.len(), 0)
out = df_test.iloc[idx, :-1].assign(list=np.concatenate(df_test.list.values))
10 loops, best of 3: 123 ms per loop
As a 1 liner, Bharath's answer is short, but slow. Here's an improvement that uses the dataframe constructor instead of df.apply
for a 200x speedup on large data:
idx = df.set_index(['col1', 'col2', 'col3']).index
out = pd.DataFrame(df.list.values.tolist(), index=idx).stack()\
.reset_index().drop('level_3', 1).rename(columns={0 : 'list'})
print(out)
col1 col2 col3 list
0 fruit apple 1 10
1 fruit apple 1 20
2 veicle bycicle 4 1.2
3 veicle bycicle 4 3
4 veicle bycicle 4 2.75
5 animal cat 2 tommy
6 animal cat 2 tom
100 loops, best of 3: 4.7 ms per loop
10 loops, best of 3: 28.9 ms per loop
Upvotes: 4
Reputation: 2901
Here is roughly how you can accomplish this task. This is not the exact solution but will give you an idea of how you accomplish your task:
original_df = <your dataframe to start>
new_empty_df = pd.DataFrame()
# now go through each row of the original df
for i in range(original_df.shape[0]):
row_Series = original_df.iloc[i]
row_list = row_Series['list']
for item in row_list:
new_empty_df.append({'col1':row_Series['col1'],
'col2':row_Series['col2'],
'list':item})
Upvotes: 0
Reputation: 30605
You can set_index of first three columns and then apply pd.Series
to the column of list and then stack them.
df.set_index(['col1','col2','col3'])['list'].apply(pd.Series).stack().reset_index().drop('level_3',axis=1)
Output:
col1 col2 col3 0 0 fruit apple 1 10 1 fruit apple 1 20 2 veicle bycicle 4 1.2 3 veicle bycicle 4 3 4 veicle bycicle 4 2.75 5 animal cat 2 tommy 6 animal cat 2 tom
Upvotes: 5