Hunter Durnford
Hunter Durnford

Reputation: 27

Need help getting the frequency of each number in a pandas dataframe

I am trying to find a simple way of converting a pandas dataframe into another dataframe with frequency of each feature. I'll provide an example of what I'm trying to do below

Current dataframe example (feature labels are just index values here):

   0   1   2   3   4   ...   n
0  2   3   1   4   2         ~
1  4   3   4   3   2         ~
2  2   3   2   3   2         ~
3  1   3   0   3   2         ~
...
m  ~   ~   ~   ~   ~         ~

Dataframe I would like to convert this to:

   0   1   2   3   4   ...   n
0  0   1   2   1   1         ~
1  0   0   1   2   2         ~
2  0   0   3   2   0         ~
3  1   1   1   2   0         ~
...
m  ~   ~   ~   ~   ~         ~

As you can see, the column label corresponds to the possible numbers within the dataframe and each frequency of that number per row is put into that specific feature for the row in question. Is there a simple way to do this with python? I have a large dataframe that I am trying to transform into a dataframe of frequencies for feature selection.

If any more information is needed I will update my post.

Upvotes: 1

Views: 93

Answers (2)

piRSquared
piRSquared

Reputation: 294278

Numpy

The value of this is speed. But OBVIOUSLY more complicated.

n, k = df.shape
i = df.index.to_numpy().repeat(k)
j = np.ravel(df)
m = j.max() + 1

a = np.zeros((n, m), int)

np.add.at(a, (i, j), 1)

pd.DataFrame(a, df.index, range(m))

   0  1  2  3  4
0  0  1  2  1  1
1  0  0  1  2  2
2  0  0  3  2  0
3  1  1  1  2  0

This produces an index i that will correspond to the values in df that I assign to j. I'll use these indices to add one at positions of an array a designated by the indices in i and j

Upvotes: 1

ansev
ansev

Reputation: 30920

Use pd.value_counts with apply:

df.apply(pd.value_counts, axis=1).fillna(0)

     0    1    2    3    4
0  0.0  1.0  2.0  1.0  1.0
1  0.0  0.0  1.0  2.0  2.0
2  0.0  0.0  3.0  2.0  0.0
3  1.0  1.0  1.0  2.0  0.0

Alternative DataFrame.melt with pd.crosstab

df2 = df.T.melt()
pd.crosstab(df2['variable'], df2['value'])

Upvotes: 3

Related Questions