user11871120
user11871120

Reputation: 55

Convert list of rows to frequency table in Pandas

I have a pandas dataframe:

   |     items
--------------
0  |    [a]
1  |    [a, b]
2  |    [d, e, f,f]
3  |    [d, f, e]
4  |    [c, a, b]

I would like to count the frequency of each item in the list and construct a table like the following:

    a|  b|  c|  d|  e|  f
-------------------------
0|  1|  0|  0|  0|  0|  0
1|  1|  1|  0|  0|  0|  0
2|  0|  0|  0|  1|  1|  2
3|  0|  0|  0|  1|  1|  1
4|  1|  1|  1|  0|  0|  0

I look at pandas.explode but I don't think that is what I want.

I can do something like this below. But I feel like there might be a more efficient way to do this. I have about 3.5 million rows.


import pandas as pd
from collections import Counter,defaultdict

df = pd.DataFrame({'items':[['a'],['a','b'],
                            ['d','e','f','f'],['d','f','e'],
                            ['c','a','b']]})


alist = sum(sum(df.values.tolist(),[]),[]) # flatten the list
unique_list = sorted(set(alist)) # get unique value for column names
unique_list

b = defaultdict(list)
for row in sum(df.values.tolist(),[]):
    counts = Counter(row)
    for name in unique_list:
        if name in counts.keys():
            b[name].append(counts[name])
        else:
            b[name].append(0)

pd.DataFrame(b)

Upvotes: 5

Views: 519

Answers (2)

user3483203
user3483203

Reputation: 51165

Since you have duplicates in your sublists, this becomes more of a pivot problem than a get_dummies, but you need to expand your sublists first.

You can use Series.explode followed by crosstab here.


ii = df['items'].explode()

pd.crosstab(ii.index, ii)

items  a  b  c  d  e  f
row_0
0      1  0  0  0  0  0
1      1  1  0  0  0  0
2      0  0  0  1  1  2
3      0  0  0  1  1  1
4      1  1  1  0  0  0

Performance

df = pd.concat([df]*10_000, ignore_index=True)

In [91]: %timeit chris(df)
1.07 s ± 5.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [92]: %timeit user11871120(df)
15.8 s ± 124 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [93]: %timeit ricky_kim(df)
56.4 s ± 1.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 4

Ricky Kim
Ricky Kim

Reputation: 2022

Another method using apply and value_counts:

df['items'].apply(pd.value_counts).fillna(0).astype(int)

OUTPUT:

   a  b  f  d  e  c
0  1  0  0  0  0  0
1  1  1  0  0  0  0
2  0  0  2  1  1  0
3  0  0  1  1  1  0
4  1  1  0  0  0  1

Upvotes: 3

Related Questions