Sezen
Sezen

Reputation: 457

How to change table form of python 3 pandas.DataFrame?

I have a pandas data frame counted and grouped by specific columns.

import pandas as pd
df = pd.DataFrame({'x':list('aaabbbbbccccc'),'y':list('2225555577777'), 'z':list('1312223224432')})
#
df.groupby(['x','y','z'])['z'].count()
# or
df.groupby(['x','y','z'])['z'].agg(['count'])
# or
df.groupby(['x','y','z'])['z'].count().reset_index(name='counts')

Results is;

   x  y  z  counts
0  a  2  1       2
1  a  2  3       1
2  b  5  2       4
3  b  5  3       1
4  c  7  2       2
5  c  7  3       1
6  c  7  4       2

How can I convert the result to following form?

   x  y 1 2 3 4
0  a  2 2 0 1 0
1  b  5 0 4 1 0
2  c  7 0 2 1 2

Upvotes: 4

Views: 981

Answers (3)

piRSquared
piRSquared

Reputation: 294328

PROJECT/KILL <-- (read: project overkill)


This uses Pandas factorize to get integer representations of unique values. The function pd.factorize returns the passed array in its new integer form as well as an array of what the unique values were.

If we do this for two arrays whose positions correspond to each other, we can perform a cross tabulation using Numpy's bincount.

Bin counting simply increments a "bin" each time a value is encountered that represents the "bin". np.bincount assumes the bins are array indices from 0:. So this starts to make sense if I want to bin count a single array of integers. But how do I handle a two dimensional array? I have to shift my integer values such that they start counting in a "new row". I also have to figure out what "new row" that is. The integer values from the unique row values represent the row. The number of unique column values represent the "shift"

tups = list(zip(df.x, df.y))
i, r = pd.factorize(tups)
j, c = pd.factorize(df.z)
n, m = len(r), len(c)
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)

pd.DataFrame(
    np.column_stack([r.tolist(), b]),
    columns=['x', 'y'] + c.tolist()
)

   x  y  1  3  2  4
0  a  2  2  1  0  0
1  b  5  0  1  4  0
2  c  7  0  1  2  2

With sorting the z's

Notice that I used Pandas factorize. I could have used Numpy's unique to do some of this. However, there are two reasons why I didn't. One, np.unique sorts values by default when returning the inverse array (that's what gives me the factorization). Sorting has a time complexity of O(n * log(n)) and can be a hit on performance for larger arrays. The second reason is that Numpy would require some additional annoying handling to perform the task on tuples.

However, in this case, I wanted to see the z columns presented in the same order OP had them. That required that I sort while factorizing. I still didn't want to use Numpy so I just used the sort flag in pd.factorize

tups = list(zip(df.x, df.y))
i, r = pd.factorize(tups)
j, c = pd.factorize(df.z, sort=True)
n, m = len(r), len(c)
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)

pd.DataFrame(
    np.column_stack([r.tolist(), b]),
    columns=['x', 'y'] + c.tolist()
)

   x  y  1  2  3  4
0  a  2  2  0  1  0
1  b  5  0  4  1  0
2  c  7  0  2  1  2

Upvotes: 3

cs95
cs95

Reputation: 402553

You will need to use unstack + reset_index:

(df.groupby(['x','y','z'])['z']
   .count()
   .unstack(-1, fill_value=0)
   .reset_index()
   .rename_axis(None, axis=1)
)

   x  y  1  2  3  4
0  a  2  2  0  1  0
1  b  5  0  4  1  0
2  c  7  0  2  1  2

Note, you can replace df.groupby(['x','y','z'])['z'].count() with df.groupby(['x','y','z']).size() for compactness, but beware that size also counts NaNs.

Upvotes: 4

BENY
BENY

Reputation: 323286

Something like crosstab

pd.crosstab([df.x,df.y],df.z).reset_index()
Out[81]: 
z  x  y  1  2  3  4
0  a  2  2  0  1  0
1  b  5  0  4  1  0
2  c  7  0  2  1  2

Upvotes: 4

Related Questions