Reputation: 737
Given this sample dataset:
id col1 col2 col3
0 [1999, 2000, 2001] [1989, 2000, 2005] [1999, 2004, 2005]
I am trying to create a fourth column that contains the total number of distinct years in each of the lists in the other three columns. So, something like combining all those lists together and then grabbing a len(list(set(combined list of all col lists)) to get in this case, the following output:
id col1 col2 col3 total_years
0 [1999, 2000, 2001] [1989, 2000, 2005] [1999, 2004, 2005] 6
I was able to produce separate columns counting all the values in each list, but simply taking the sum of ten of all column lists, gives the wrong value because of duplicate years in some lists.
The ideal operation would be to combine all the lists into one and get a set from that and the length of that set it seems to me. But, cannot figure how that would work within pandas.
One attempt that didn't work:
for col in df.columns[1:]:
new_col = str(col) + "_length"
df[new_col] = df[col].apply(len)
df['new_col_name'] = df.iloc[:,9:].sum(axis=1)
Second attempt after combining all col lists into a single list in one column. I tried to get the length of the set of that one column, nope, didn't work.
df['Total_years'] = len(set(total_lst_col))
Upvotes: 1
Views: 144
Reputation: 59519
stack
all of the lists into a single column and then explode
. Then you can groupby
+ nunique
along the original index. This will work even if you have single years that aren't in a list
.
import pandas as pd
df = pd.DataFrame({'id': ['A', 'B', 'C'],
'col1': [[1999, 2000, 2001], [1999, 2001], 2000],
'col2': [[1989, 2000, 2005], [1989, 2000, 2005], [2001, 2002]],
'col3': [[1999, 2004, 2005], [1999], [2005]]})
#.iloc[:, 1:] to ignore the `id` column
df['total_years'] = df.iloc[:, 1:].stack().explode().groupby(level=0).nunique()
id col1 col2 col3 total_years
0 A [1999, 2000, 2001] [1989, 2000, 2005] [1999, 2004, 2005] 6
1 B [1999, 2001] [1989, 2000, 2005] [1999] 5
2 C 2000 [2001, 2002] [2005] 4
Here's an explanation:
The first stack
step puts everything into a Series with a MultiIndex, the 0th level is the original DataFrame index, the 1st level is the column it belonged to (which we don't really care about).
df.iloc[:, 1:].stack()
0 col1 [1999, 2000, 2001]
col2 [1989, 2000, 2005]
col3 [1999, 2004, 2005]
1 col1 [1999, 2001]
col2 [1989, 2000, 2005]
col3 [1999]
2 col1 2000
col2 [2001, 2002]
col3 [2005]
explode
then expands those list to a single row per value, and keeps non-list values the same. The index of the above Series is repeated for each element of the same list
df.iloc[:, 1:].stack().explode()
0 col1 1999 # index `0` `col1` had 3 elements in
col1 2000 # the list which get expanded to
col1 2001 # these three rows
col2 1989
...
2 col1 2000 # This value wasn't a list, explode keeps as is
col2 2001
col2 2002
col3 2005
Finally .groupby(level=0)
groups over that level, which is based on the original DataFrame index and we count the number of unique values, which based on this Series are the unique years found in any of those columns.
Upvotes: 4
Reputation: 54148
You can use an apply
on each row
chain
all sublists in oneset
to keep the uniqueslen
to get the countfrom itertools import chain
df['total_years'] = df.apply(lambda x: len(set(chain(*x))), axis=1)
Upvotes: 0