Reputation: 2014
I have a pandas dataframe:
apple banana carrot diet coke
1 1 1 0
0 1 0 0
1 0 0 0
1 0 1 1
0 1 1 0
0 1 1 0
I would like to convert this to the following:
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'diet coke'],
['banana', 'carrot'],
['banana', 'carrot']]
How can I do it? Thanks a lot.
Upvotes: 5
Views: 682
Reputation: 31662
@DSM solution is great, however it's working only when your values 1
or 0
. If you need to compare it with other value you could try that:
[df.columns[df.ix[i,:]==1].tolist() for i in range(len(df.index))]
In [156]: [df.columns[df.ix[i,:]==1].tolist() for i in range(len(df.index))]
Out[156]:
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'dietcoke'],
['banana', 'carrot'],
['banana', 'carrot']]
EDIT
Although you could just modify a bit @DSM solution:
In [177]: [df.columns[row == 1].tolist() for row in df.values]
Out[177]:
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'dietcoke'],
['banana', 'carrot'],
['banana', 'carrot']]
Some perfomance tests:
In [179]: %timeit [df.columns[row == 1].tolist() for row in df.values]
The slowest run took 4.03 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 212 us per loop
In [180]: %timeit [df.columns[row.astype(bool)].tolist() for row in df.values]
10000 loops, best of 3: 186 us per loop
In [181]: %timeit [df.columns[df.ix[i,:]==1].tolist() for i in range(len(df.index))]
100 loops, best of 3: 2.4 ms per loop
Upvotes: 2
Reputation: 4375
You could treverse and create as Pedro mentioned or just use stack()
and groupby()
to list,
df
Out[14]:
apple banana carrot diet_coke
0 1 1 1 0
1 0 1 0 0
2 1 0 0 0
3 1 0 1 1
4 0 1 1 0
5 0 1 1 0
df.stack()
Out[15]:
0 apple 1
banana 1
carrot 1
diet_coke 0
1 apple 0
banana 1
carrot 0
diet_coke 0
2 apple 1
banana 0
carrot 0
diet_coke 0
3 apple 1
banana 0
carrot 1
diet_coke 1
4 apple 0
banana 1
carrot 1
diet_coke 0
5 apple 0
banana 1
carrot 1
diet_coke 0
dtype: int64
df.stack()[df.stack().values ==1].reset_index()
Out[20]:
level_0 level_1 0
0 0 apple 1
1 0 banana 1
2 0 carrot 1
3 1 banana 1
4 2 apple 1
5 3 apple 1
6 3 carrot 1
7 3 diet_coke 1
8 4 banana 1
9 4 carrot 1
10 5 banana 1
11 5 carrot 1
newdf.groupby(['level_0'])['level_1'].apply(list)
Out[27]:
level_0
0 [apple, banana, carrot]
1 [banana]
2 [apple]
3 [apple, carrot, diet_coke]
4 [banana, carrot]
5 [banana, carrot]
Name: level_1, dtype: object
Upvotes: 1
Reputation: 28083
In [24]: import pandas as pd
In [25]: import io
In [26]: data = """
apple banana carrot dietcoke
1 1 1 0
0 1 0 0
1 0 0 0
1 0 1 1
0 1 1 0
0 1 1 0
"""
In [27]: df = pd.read_csv(io.StringIO(data), delimiter='\s+')
In [28]: df
Out[28]:
apple banana carrot dietcoke
0 1 1 1 0
1 0 1 0 0
2 1 0 0 0
3 1 0 1 1
4 0 1 1 0
5 0 1 1 0
In [29]: [[df.columns[i] for i,field in enumerate(record) if field == 1] for j,*record in df.itertuples()]
Out[29]:
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'dietcoke'],
['banana', 'carrot'],
['banana', 'carrot']]
The solution, without using list comprehension and extended tuple unpacking is shown below:
In [32]: result = []
In [33]: for record in df.itertuples():
....: row = []
....: for i,field in enumerate(record[1:]):
....: if field == 1:
....: row.append(df.columns[i])
....: result.append(row)
....:
In [34]: result
Out[34]:
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'dietcoke'],
['banana', 'carrot'],
['banana', 'carrot']]
Upvotes: 1
Reputation: 352959
Because life is short, I might do something straightforward like
>>> fruit = [df.columns[row.astype(bool)].tolist() for row in df.values]
>>> pprint.pprint(fruit)
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'diet coke'],
['banana', 'carrot'],
['banana', 'carrot']]
This works because we can use a boolean array (row.astype(bool)
) to select only the elements of df.columns
where the row has True.
Upvotes: 6