skad00sh
skad00sh

Reputation: 171

Calculate number of time series in data frame for required hierarchy

I wanted to calculate the number of series present in the given data.

I need this information for the time-series count.

Here is the data

Here I would like the user to select how to check series.

e.g. Series can be Region > Product > Country (please take this selection for this code also)

Now, series are:

  1. Asia > A > India
  2. Asia > A > Thailand
  3. Asia > B > India
  4. Asia > B > Thailand
  5. Asia > D > Japan
  6. Europe > A > Italy
  7. Europe > A > Turkey
  8. Europe > B > Italy
  9. Asia > A
  10. Asia > B
  11. Asia > D
  12. Europe > A
  13. Europe > B
  14. Asia
  15. Europe
  16. World (not included in data frame, so need to do '+1' in code)

As you can see, total of 15 time-series are present (16 including world).

So I need an answer as '16' since there are 16 time-series for selected hierarchy.

I was successfully able to do this by converting CSV to excel and then counting all series. But it is very time consuming if I have large data.

deleted

Note: Hierarchy used in the above code is Region > Country > Product

Is it possible to do this without creating new excel file?

Also, GroupBy() is not sufficient in this case.

You can check my old question which uses groupby(), but this question is different than old one.

Here is the numpy array for you:

array([['Asia', 'India', 'A', 200],
       ['Asia', 'Thailand', 'A', 150],
       ['Asia', 'India', 'B', 175],
       ['Asia', 'Thailand', 'B', 225],
       ['Asia', 'Japan', 'D', 325],
       ['Europe', 'Italy', 'A', 120],
       ['Europe', 'Turkey', 'A', 130],
       ['Europe', 'Italy', 'B', 160]], dtype=object)

Upvotes: 0

Views: 46

Answers (1)

Ke Zhang
Ke Zhang

Reputation: 987

If I understand you question correctly, total number of series will be the sum of series with different length. Check code below.

import numpy as np
import pandas as pd
# Construct df from np array
arr = np.array([['Asia', 'India', 'A', 200],
       ['Asia', 'Thailand', 'A', 150],
       ['Asia', 'India', 'B', 175],
       ['Asia', 'Thailand', 'B', 225],
       ['Asia', 'Japan', 'D', 325],
       ['Europe', 'Italy', 'A', 120],
       ['Europe', 'Turkey', 'A', 130],
       ['Europe', 'Italy', 'B', 160]], dtype=object)
title = ['Region', 'Country', 'Product', 'Sales'] 
df = pd.DataFrame(columns = title, data = arr) 
# Remove duplicates
df = df.drop_duplicates()

# Count 3 nodes Series plus 'world' like you mentioned in question
count = len(df) + 1 
for i in list(df.groupby('Region')):
    # component before + 1 is counting 2 nodes Series 
    # +1 at the end is counting 1 node Series
    count = count + len(list(i[1].groupby('Product'))) + 1 

Upvotes: 1

Related Questions