Reputation: 253
I have a file that looks like this:
Location Code Trait ID Effective Date
WAU1 23984,24896,27576 06/05/2014 ,06/05/2014 ,06/12/2014
WAU2 126973,219332 06/05/2014 ,06/05/2014
WAU3 24375 06/05/2014
WAU4 23984 06/05/2014
WAU5 5199,23984 NULL
WAU6 12342,224123 06/05/2014
Notice how the 2nd and 3rd column are a "list" of values. Some rows contain exact matches in terms of number of elements in each list, others are missing, or not there at all (null). I need to create a single dataframe that is much like the following
Location Code Trait ID Effective Date
0 WAU1 23984 06/05/2014
1 WAU1 24896 06/05/2014
2 WAU1 27576 06/12/2014
3 WAU2 126973 06/05/2014
4 WAU2 219332 06/05/2014
5 WAU3 24375 06/05/2014
6 WAU4 23984 06/05/2014
7 WAU5 5199 NaN
8 WAU5 23984 NaN
9 WAU6 12342 06/05/2014
10 WAU6 224123 NaN
I've been able to break each of the "list" columns into separate dataframes using the following:
df1 = df1['Trait ID'].str.split(',').apply(pd.Series,1).stack()
df1.index = df1.index.droplevel(-1)
df1.name = 'Trait ID'
del df1['Trait ID']
df1 = df1.join(trait_id)
Which gives me something like:
Location Code Trait ID
0 WAU1 23984
0 WAU1 24896
0 WAU1 27576
1 WAU2 126973
1 WAU2 219332
2 WAU3 24375
3 WAU4 23984
4 WAU5 5199
4 WAU5 23984
5 WAU6 12342
5 WAU6 224123
And I can create another dataframe with the "Effective Date" list using the same logic above to produce the following:
Location Code Effective Date
0 WAU1 06/05/2014
0 WAU1 06/05/2014
0 WAU1 06/12/2014
1 WAU2 06/05/2014
1 WAU2 06/05/2014
2 WAU3 06/05/2014
3 WAU4 06/05/2014
4 WAU5 NaN
5 WAU6 06/05/2014
I'm struggling to find the proper "function" in pandas (e.g join, merge, concat) to combine the two dataframes into my desired output. Although I have a feeling its a combination of them, with a reset_index() in there somewhere.
Upvotes: 1
Views: 375
Reputation: 42875
Starting with:
Location Code Trait ID Effective Date
0 WAU1 23984, 24896, 27576 06/05/2014,06/05/2014,06/12/2014
1 WAU2 126973, 219332 06/05/2014,06/05/2014
2 WAU3 24375 2014-06-05 00:00:00
3 WAU4 23984 2014-06-05 00:00:00
4 WAU5 5199, 23984 NaN
5 WAU6 12342, 224123 2014-06-05 00:00:00
You could groupby('Location Code')
, use str.split(',') with
expand=True, pivot the result using
stack()and
concat` for each group:
df1.groupby('Location Code').apply(lambda x: pd.concat([x['Trait ID'].str.split(',', expand=True).stack(), x['Effective Date'].str.split(',', expand=True).stack()], axis=1)).reset_index([1, 2], drop=True)
to get:
0 1
Location Code
WAU1 23984 06/05/2014
WAU1 24896 06/05/2014
WAU1 27576 06/12/2014
WAU2 126973 06/05/2014
WAU2 219332 06/05/2014
WAU3 24375 2014-06-05 00:00:00
WAU4 23984 2014-06-05 00:00:00
WAU5 5199 nan
WAU5 23984 NaN
WAU6 12342 2014-06-05 00:00:00
WAU6 224123 NaN
Upvotes: 1