Reputation: 129
Let's suppose I have this dataset. Each variable can assume values between 1 and 5. The variables represent similar objects, so I would like to compare them across all data sets.
import pandas as pd
df = pd.DataFrame({'dataset': ["ds1", "ds2", "ds3", "ds4", "ds5"],
'var1': [1, 4, 5, 3, 1],
'var2': [2, 1, 1, 2, 5],
'var3': [2, 1, 1, 2, 5]})
df
I want to transpose the df using .T and sum up the data as follows:
rows: each representing one variable (var1 to var3) columns 1 - 5 : representing values of these variables, each cell showing the total number of appearances of this value across all datasets (var1.value_counts)
Example: Row1 represents var1. Column1 shows "1" appears two times across all datasets, column2 shows 2 appears 0 times
Expected outcome
1 2 3 4 5
var1 2 0 1 1 1
var2 2 2 0 0 1
var3 2 2 0 0 1
Maybe I need to use crosstab or pivot_table? Thanks!
Upvotes: 2
Views: 992
Reputation: 28689
Yes, crosstab
works fine (or pivot table
in @jezrael's solution), after melt
:
tab = df.melt("dataset")
pd.crosstab(tab.variable, tab.value).rename_axis(index=None, columns=None)
1 2 3 4 5
var1 2 0 1 1 1
var2 2 2 0 0 1
var3 2 2 0 0 1
Yes, you can use the normalize
argument, and choose if you want it via rows(index) or columns :
pd.crosstab(tab.variable, tab.value, normalize='all').rename_axis(index=None, columns=None)
Upvotes: 2
Reputation: 862761
Use DataFrame.melt
with DataFrame.pivot_table
and DataFrame.rename_axis
:
df1 = (df.melt('dataset')
.pivot_table(index='variable',
columns='value',
aggfunc='size',
fill_value=0)
.rename_axis(index=None, columns=None))
print (df1)
1 2 3 4 5
var1 2 0 1 1 1
var2 2 2 0 0 1
var3 2 2 0 0 1
Or for all columns without first value_counts
with transpose, replaced missing values to 0
and last converting all values to integers:
df = df.set_index('dataset').apply(pd.value_counts).T.fillna(0).astype(int)
print (df)
1 2 3 4 5
var1 2 0 1 1 1
var2 2 2 0 0 1
var3 2 2 0 0 1
Upvotes: 2