user3078523
user3078523

Reputation: 1700

Descriptive statistics on aggregated data in Python

I have a dataset of numerical data that is already aggregated, i.e. it contains pairs of: original value, count (number of occurrences of given value in original dataset).

How to get descriptive statistics of original dataset using only the aggregated one. I'm looking for simple solution (preferably using existing libraries and functions).

Example:

Let's assume original dataset is [1, 1, 1, 1, 1, 2, 2, 2, 4]. I can compute descriptive statistics as follows (e.g. using Pandas):

data = [1, 1, 1, 1, 1, 2, 2, 2, 4]
df = pandas.DataFrame(data, columns = ['value'])
print(df.describe())

Output:

          value
count  9.000000
mean   1.666667
std    1.000000
min    1.000000
25%    1.000000
50%    1.000000
75%    2.000000
max    4.000000

The same dataset but aggregated would look like this: [[1, 5], [2, 3], [4, 1]] (value 1 occurs 5 times, value 2 occurs 3 times, value 4 occurs once). I would like to get the same output using the aggregated dataset.

Upvotes: 0

Views: 224

Answers (1)

anky
anky

Reputation: 75080

Lets say your aggregated df dataframe looks like:

print(df_agg) #read below df by df_agg = pd.read_clipboard()

   value  Size
0      1     5
1      2     3
2      4     1

You can use the pd.Index.repeat function to do this:

df_agg.loc[df_agg.index.repeat(df_agg['Size']),['value']].describe()

Or np.repeat:

pd.DataFrame(np.repeat(df_agg['value'],df_agg['Size'])).describe()

          value
count  9.000000
mean   1.666667
std    1.000000
min    1.000000
25%    1.000000
50%    1.000000
75%    2.000000
max    4.000000

Where :

print(df_agg.loc[df_agg.index.repeat(df_agg['Size']),['value']])

Outputs:

   value
0      1
0      1
0      1
0      1
0      1
1      2
1      2
1      2
2      4

Upvotes: 2

Related Questions