Reputation: 1700
I have a dataset of numerical data that is already aggregated, i.e. it contains pairs of: original value, count (number of occurrences of given value in original dataset).
How to get descriptive statistics of original dataset using only the aggregated one. I'm looking for simple solution (preferably using existing libraries and functions).
Let's assume original dataset is [1, 1, 1, 1, 1, 2, 2, 2, 4]
.
I can compute descriptive statistics as follows (e.g. using Pandas):
data = [1, 1, 1, 1, 1, 2, 2, 2, 4]
df = pandas.DataFrame(data, columns = ['value'])
print(df.describe())
Output:
value
count 9.000000
mean 1.666667
std 1.000000
min 1.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 4.000000
The same dataset but aggregated would look like this: [[1, 5], [2, 3], [4, 1]]
(value 1
occurs 5 times, value 2
occurs 3 times, value 4
occurs once).
I would like to get the same output using the aggregated dataset.
Upvotes: 0
Views: 224
Reputation: 75080
Lets say your aggregated df dataframe looks like:
print(df_agg) #read below df by df_agg = pd.read_clipboard()
value Size
0 1 5
1 2 3
2 4 1
You can use the pd.Index.repeat
function to do this:
df_agg.loc[df_agg.index.repeat(df_agg['Size']),['value']].describe()
Or np.repeat
:
pd.DataFrame(np.repeat(df_agg['value'],df_agg['Size'])).describe()
value
count 9.000000
mean 1.666667
std 1.000000
min 1.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 4.000000
Where :
print(df_agg.loc[df_agg.index.repeat(df_agg['Size']),['value']])
Outputs:
value
0 1
0 1
0 1
0 1
0 1
1 2
1 2
1 2
2 4
Upvotes: 2