pandas calculate total years from lists of years in multiple columns

Question

Given this sample dataset:

id    col1                    col2                 col3
0    [1999, 2000, 2001]   [1989, 2000, 2005]   [1999, 2004, 2005]

I am trying to create a fourth column that contains the total number of distinct years in each of the lists in the other three columns. So, something like combining all those lists together and then grabbing a len(list(set(combined list of all col lists)) to get in this case, the following output:

id    col1                    col2                 col3               total_years
0    [1999, 2000, 2001]   [1989, 2000, 2005]   [1999, 2004, 2005]        6

I was able to produce separate columns counting all the values in each list, but simply taking the sum of ten of all column lists, gives the wrong value because of duplicate years in some lists.

The ideal operation would be to combine all the lists into one and get a set from that and the length of that set it seems to me. But, cannot figure how that would work within pandas.

One attempt that didn't work:

for col in df.columns[1:]:
    new_col = str(col) + "_length"
    df[new_col] = df[col].apply(len)
    df['new_col_name'] = df.iloc[:,9:].sum(axis=1)

Second attempt after combining all col lists into a single list in one column. I tried to get the length of the set of that one column, nope, didn't work.

df['Total_years'] = len(set(total_lst_col))

ALollz · Accepted Answer

stack all of the lists into a single column and then explode. Then you can groupby + nunique along the original index. This will work even if you have single years that aren't in a list.

import pandas as pd
df = pd.DataFrame({'id': ['A', 'B', 'C'],
                   'col1': [[1999, 2000, 2001], [1999, 2001], 2000],
                   'col2': [[1989, 2000, 2005], [1989, 2000, 2005], [2001, 2002]],
                   'col3': [[1999, 2004, 2005], [1999], [2005]]})

#.iloc[:, 1:] to ignore the `id` column
df['total_years'] = df.iloc[:, 1:].stack().explode().groupby(level=0).nunique()

  id                col1                col2                col3  total_years
0  A  [1999, 2000, 2001]  [1989, 2000, 2005]  [1999, 2004, 2005]            6
1  B        [1999, 2001]  [1989, 2000, 2005]              [1999]            5
2  C                2000        [2001, 2002]              [2005]            4

Here's an explanation:

The first stack step puts everything into a Series with a MultiIndex, the 0th level is the original DataFrame index, the 1st level is the column it belonged to (which we don't really care about).

df.iloc[:, 1:].stack()
0  col1    [1999, 2000, 2001]
   col2    [1989, 2000, 2005]
   col3    [1999, 2004, 2005]
1  col1          [1999, 2001]
   col2    [1989, 2000, 2005]
   col3                [1999]
2  col1                  2000
   col2          [2001, 2002]
   col3                [2005]

explode then expands those list to a single row per value, and keeps non-list values the same. The index of the above Series is repeated for each element of the same list

df.iloc[:, 1:].stack().explode()
0  col1    1999    # index `0` `col1` had 3 elements in
   col1    2000    # the list which get expanded to 
   col1    2001    # these three rows
   col2    1989
...
2  col1    2000    # This value wasn't a list, explode keeps as is
   col2    2001
   col2    2002
   col3    2005

Finally .groupby(level=0) groups over that level, which is based on the original DataFrame index and we count the number of unique values, which based on this Series are the unique years found in any of those columns.

pandas calculate total years from lists of years in multiple columns

Answers (2)

Related Questions