Calculate number of time series in data frame for required hierarchy

Question

I wanted to calculate the number of series present in the given data.

I need this information for the time-series count.

Here I would like the user to select how to check series.

e.g. Series can be Region > Product > Country (please take this selection for this code also)

Now, series are:

Asia > A > India
Asia > A > Thailand
Asia > B > India
Asia > B > Thailand
Asia > D > Japan
Europe > A > Italy
Europe > A > Turkey
Europe > B > Italy
Asia > A
Asia > B
Asia > D
Europe > A
Europe > B
Asia
Europe
World (not included in data frame, so need to do '+1' in code)

As you can see, total of 15 time-series are present (16 including world).

So I need an answer as '16' since there are 16 time-series for selected hierarchy.

I was successfully able to do this by converting CSV to excel and then counting all series. But it is very time consuming if I have large data.

deleted

Note: Hierarchy used in the above code is Region > Country > Product

Is it possible to do this without creating new excel file?

Also, GroupBy() is not sufficient in this case.

You can check my old question which uses groupby(), but this question is different than old one.

Here is the numpy array for you:

array([['Asia', 'India', 'A', 200],
       ['Asia', 'Thailand', 'A', 150],
       ['Asia', 'India', 'B', 175],
       ['Asia', 'Thailand', 'B', 225],
       ['Asia', 'Japan', 'D', 325],
       ['Europe', 'Italy', 'A', 120],
       ['Europe', 'Turkey', 'A', 130],
       ['Europe', 'Italy', 'B', 160]], dtype=object)

Ke Zhang · Accepted Answer

If I understand you question correctly, total number of series will be the sum of series with different length. Check code below.

import numpy as np
import pandas as pd
# Construct df from np array
arr = np.array([['Asia', 'India', 'A', 200],
       ['Asia', 'Thailand', 'A', 150],
       ['Asia', 'India', 'B', 175],
       ['Asia', 'Thailand', 'B', 225],
       ['Asia', 'Japan', 'D', 325],
       ['Europe', 'Italy', 'A', 120],
       ['Europe', 'Turkey', 'A', 130],
       ['Europe', 'Italy', 'B', 160]], dtype=object)
title = ['Region', 'Country', 'Product', 'Sales'] 
df = pd.DataFrame(columns = title, data = arr) 
# Remove duplicates
df = df.drop_duplicates()

# Count 3 nodes Series plus 'world' like you mentioned in question
count = len(df) + 1 
for i in list(df.groupby('Region')):
    # component before + 1 is counting 2 nodes Series 
    # +1 at the end is counting 1 node Series
    count = count + len(list(i[1].groupby('Product'))) + 1

Calculate number of time series in data frame for required hierarchy

Answers (1)

Related Questions