Jeff Pipas
Jeff Pipas

Reputation: 253

Creating a single dataframe out of two columns each containing lists

I have a file that looks like this:

Location Code   Trait ID    Effective Date
WAU1    23984,24896,27576   06/05/2014 ,06/05/2014 ,06/12/2014 
WAU2    126973,219332   06/05/2014 ,06/05/2014 
WAU3    24375   06/05/2014 
WAU4    23984   06/05/2014 
WAU5    5199,23984  NULL
WAU6    12342,224123    06/05/2014 

Notice how the 2nd and 3rd column are a "list" of values. Some rows contain exact matches in terms of number of elements in each list, others are missing, or not there at all (null). I need to create a single dataframe that is much like the following

   Location Code Trait ID  Effective Date
       0    WAU1    23984   06/05/2014
       1    WAU1    24896   06/05/2014
       2    WAU1    27576   06/12/2014
       3    WAU2    126973  06/05/2014
       4    WAU2    219332  06/05/2014
       5    WAU3    24375   06/05/2014
       6    WAU4    23984   06/05/2014
       7    WAU5    5199    NaN
       8    WAU5    23984   NaN
       9    WAU6    12342   06/05/2014
       10   WAU6    224123  NaN

I've been able to break each of the "list" columns into separate dataframes using the following:

df1 = df1['Trait ID'].str.split(',').apply(pd.Series,1).stack()
df1.index = df1.index.droplevel(-1)
df1.name = 'Trait ID'
del df1['Trait ID']
df1 = df1.join(trait_id)

Which gives me something like:

  Location Code Trait ID
0          WAU1    23984
0          WAU1    24896
0          WAU1    27576
1          WAU2   126973
1          WAU2   219332
2          WAU3    24375
3          WAU4    23984
4          WAU5     5199
4          WAU5    23984
5          WAU6    12342
5          WAU6   224123

And I can create another dataframe with the "Effective Date" list using the same logic above to produce the following:

  Location Code Effective Date
0          WAU1    06/05/2014 
0          WAU1    06/05/2014 
0          WAU1    06/12/2014 
1          WAU2    06/05/2014 
1          WAU2    06/05/2014 
2          WAU3    06/05/2014 
3          WAU4    06/05/2014 
4          WAU5            NaN
5          WAU6    06/05/2014 

I'm struggling to find the proper "function" in pandas (e.g join, merge, concat) to combine the two dataframes into my desired output. Although I have a feeling its a combination of them, with a reset_index() in there somewhere.

Upvotes: 1

Views: 375

Answers (1)

Stefan
Stefan

Reputation: 42875

Starting with:

  Location Code             Trait ID                    Effective Date
0          WAU1  23984, 24896, 27576  06/05/2014,06/05/2014,06/12/2014
1          WAU2       126973, 219332             06/05/2014,06/05/2014
2          WAU3                24375               2014-06-05 00:00:00
3          WAU4                23984               2014-06-05 00:00:00
4          WAU5          5199, 23984                               NaN
5          WAU6        12342, 224123               2014-06-05 00:00:00

You could groupby('Location Code'), use str.split(',') withexpand=True, pivot the result usingstack()andconcat` for each group:

df1.groupby('Location Code').apply(lambda x: pd.concat([x['Trait ID'].str.split(',', expand=True).stack(), x['Effective Date'].str.split(',', expand=True).stack()], axis=1)).reset_index([1, 2], drop=True)

to get:

                     0                    1
Location Code                              
WAU1             23984           06/05/2014
WAU1             24896           06/05/2014
WAU1             27576           06/12/2014
WAU2            126973           06/05/2014
WAU2            219332           06/05/2014
WAU3             24375  2014-06-05 00:00:00
WAU4             23984  2014-06-05 00:00:00
WAU5              5199                  nan
WAU5             23984                  NaN
WAU6             12342  2014-06-05 00:00:00
WAU6            224123                  NaN

Upvotes: 1

Related Questions