user308827
user308827

Reputation: 21971

creating new dataframe by combining different groups

          id      source                 Date       CVI
67092  10365     sat     2016-11-20 00:00:00      1.9
67093  10365     nel-2   2016-11-20 00:00:00      1.5
67094  10365     sat     2016-12-06 00:00:00   1.436964
67095  10365     sat     2016-12-22 00:00:00        NaN
67096  10365     sat     2017-01-07 00:00:00   4.835949
67097  10365     sat     2017-01-23 00:00:00  10.033607
...      ...         ...                  ...        ...
723    84650  nel-2      2020-04-03 00:00:00   5.085851
724    84650  nel-2      2020-04-08 00:00:00   4.375207
725    84650  nel-2      2020-04-13 00:00:00   0.040688
726    84650  nel-2      2020-04-18 00:00:00   3.151000

From the dataframe above, I would like to create a new dataframe where id for same Date and different source are in same row. e.g. the output should look like:

         id              Date                 sat_CVI   nel-2_CVI
0       10365        2016-11-20 00:00:00        1.9        1.5

I tried using groupby but it does not do what I want. Please note that the resulting table has only those rows where the same id has a CVI value for the same Date for both the source

Upvotes: 1

Views: 73

Answers (1)

jezrael
jezrael

Reputation: 862731

If there is no duplicated triples id, Date and source values use DataFrame.set_index with Series.unstack for reshape and then chnage columns names by DataFrame.add_suffix with remove rows with missing values by DataFrame.dropna:

df1 = (df.set_index(['id','Date','source'])['CVI']
         .unstack()
         .add_suffix('_CVI')
         .dropna()
         .reset_index()
         .rename_axis(None, axis=1))
print (df1)
      id                 Date  nel-2_CVI  sat_CVI
0  10365  2016-11-20 00:00:00        1.5      1.9

If possible duplicates first solution fail is necessary DataFrame.pivot_table (it also working for no duplicated triples, then is no aggregation) with some aggregate fucntion like sum, mean instead set_index with unstack:

print (df)
          id source                 Date        CVI
67092  10365    sat  2016-11-20 00:00:00   1.900000 <- duplicated triple
67092  10365    sat  2016-11-20 00:00:00   1.000000 <- duplicated triple
67093  10365  nel-2  2016-11-20 00:00:00   1.500000
67094  10365    sat  2016-12-06 00:00:00   1.436964
67095  10365    sat  2016-12-22 00:00:00        NaN
67096  10365    sat  2017-01-07 00:00:00   4.835949
67097  10365    sat  2017-01-23 00:00:00  10.033607
723    84650  nel-2  2020-04-03 00:00:00   5.085851
724    84650  nel-2  2020-04-08 00:00:00   4.375207
725    84650  nel-2  2020-04-13 00:00:00   0.040688
726    84650  nel-2  2020-04-18 00:00:00   3.151000

df1 = (df.pivot_table(index=['id','Date'],columns='source', values='CVI', aggfunc='sum')
         .add_suffix('_CVI')
         .dropna()
         .reset_index()
         .rename_axis(None, axis=1))
print (df1)
      id                 Date  nel-2_CVI  sat_CVI
0  10365  2016-11-20 00:00:00        1.5      2.9

Upvotes: 1

Related Questions