Descriptive statistics on aggregated data in Python

Question

I have a dataset of numerical data that is already aggregated, i.e. it contains pairs of: original value, count (number of occurrences of given value in original dataset).

How to get descriptive statistics of original dataset using only the aggregated one. I'm looking for simple solution (preferably using existing libraries and functions).

Example:

Let's assume original dataset is [1, 1, 1, 1, 1, 2, 2, 2, 4]. I can compute descriptive statistics as follows (e.g. using Pandas):

data = [1, 1, 1, 1, 1, 2, 2, 2, 4]
df = pandas.DataFrame(data, columns = ['value'])
print(df.describe())

Output:

          value
count  9.000000
mean   1.666667
std    1.000000
min    1.000000
25%    1.000000
50%    1.000000
75%    2.000000
max    4.000000

The same dataset but aggregated would look like this: [[1, 5], [2, 3], [4, 1]] (value 1 occurs 5 times, value 2 occurs 3 times, value 4 occurs once). I would like to get the same output using the aggregated dataset.

anky · Accepted Answer

Lets say your aggregated df dataframe looks like:

print(df_agg) #read below df by df_agg = pd.read_clipboard()

   value  Size
0      1     5
1      2     3
2      4     1

You can use the pd.Index.repeat function to do this:

df_agg.loc[df_agg.index.repeat(df_agg['Size']),['value']].describe()

Or np.repeat:

pd.DataFrame(np.repeat(df_agg['value'],df_agg['Size'])).describe()

          value
count  9.000000
mean   1.666667
std    1.000000
min    1.000000
25%    1.000000
50%    1.000000
75%    2.000000
max    4.000000

Where :

print(df_agg.loc[df_agg.index.repeat(df_agg['Size']),['value']])

Outputs:

Descriptive statistics on aggregated data in Python

Example:

Answers (1)

Related Questions