Reputation: 109
I have a question regarding duplicating rows in a pandas dataframe. I have allocated relevant dates to each observation in the column "relevant shocks" in lists. Observation 22 has an empty list, 23 a list with one date, 24 a list with two dates and 25 a list with three dates (as seen in column "listlength").
My aim is to expand the dataframe in the way that observations with empty lists remain in the dataset with one row while rows with x observations get duplicated x times - as such, rows 22 and 23 should stay in the dataframe once (22 despite the empty list and 23 because it has one relevant date), row 24 should get duplicated once and thus be in the dataframe twice and observation 25 should be duplicated twice and thus be in the dataframe thrice. As such, each row should be in the dataframe as many times as it has relevant shocks (as measured by listlength). Except for the ones with list length 0, they should still remain in the dataframe.
Further, I want to create a new column "relevant shock" which is filled by each of the relevant shocks once and separately.
This is the current dataframe:
quarter year pddate relevant shocks listlength
22 1 2012 2012-02-15 [] 0.0
23 4 2011 2011-11-15 [2011-08-18 00:00:00] 1.0
24 3 2011 2011-08-15 [2011-08-18 00:00:00, 2011-09-22 00:00:00] 2.0
25 2 2011 2011-05-13 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00 3.0
The new dataframe should have 7 rows and look as follows:
quarter year pddate relevant shocks listlength relevant shock
22 1 2012 2012-02-15 [] 0.0
23 4 2011 2011-11-15 [2011-08-18 00:00:00] 1.0 2011-08-18 00:00:00
24 3 2011 2011-08-15 [2011-08-18 00:00:00, 2011-09-22 00:00:00] 2.0 2011-08-18 00:00:00
25 3 2011 2011-08-15 [2011-08-18 00:00:00, 2011-09-22 00:00:00] 2.0 2011-09-22 00:00:00
26 2 2011 2011-05-13 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00 3.0 2011-08-04 00:00:00
27 2 2011 2011-05-13 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00 3.0 2011-08-08 00:00:00
28 2 2011 2011-05-13 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00 3.0 2011-08-10 00:00:00
So the basic idea would be to add the new column "relevant shock", go through each row, keep it unchanged if it has an empty list in "relevant shocks", also keep it unchanged if it has one date in "relevant shocks", but fill the new column "relevant shock" with that one list entry, duplicate it if it has two list entries in "relevant shocks" and fill the column "relevant shock" in each row with one of the two list entries, respectively, and so on.
Is this possible with Python?
Upvotes: 3
Views: 5466
Reputation: 29635
EDIT for pandas version >= 0.25, a new method explode
would do the job really easily:
#first create a copy of the column
df['relevant shock'] = df['relevant shocks']
#explode the new column
df = df.explode('relevant shock').fillna('')
print (df)
#same result than the one below
Old answer
From the column 'relevant shocks' you can use apply
, pd.Series
and stack
to create a row for each date, such as:
df['relevant shocks'].apply(pd.Series).stack()
Out[448]:
23 0 2011-08-18 00:00:00
24 0 2011-08-18 00:00:00
1 2011-09-22 00:00:00
25 0 2011-08-04 00:00:00
1 2011-08-08 00:00:00
2 2011-08-10 00:00:00
dtype: object
I know the one empty is missing but after you join
the result to your df
with a reset_index
, fillna
and drop
the extra column. With a df
like this:
df = pd.DataFrame({'quarter':[1,2,3,4],
'relevant shocks':[[],['2011-08-18 00:00:00'],
['2011-08-18 00:00:00', '2011-09-22 00:00:00'],
['2011-08-04 00:00:00', '2011-08-08 00:00:00', '2011-08-10 00:00:00']]},
index=[22,23,24,25])
then you do:
df = (df.join(df['relevant shocks'].apply(pd.Series).stack()
.reset_index(1,name='relevant shock'))
.fillna('').drop('level_1',1))
and you get:
quarter relevant shocks \
22 1 []
23 2 [2011-08-18 00:00:00]
24 3 [2011-08-18 00:00:00, 2011-09-22 00:00:00]
24 3 [2011-08-18 00:00:00, 2011-09-22 00:00:00]
25 4 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...
25 4 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...
25 4 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...
relevant shock
22
23 2011-08-18 00:00:00
24 2011-08-18 00:00:00
24 2011-09-22 00:00:00
25 2011-08-04 00:00:00
25 2011-08-08 00:00:00
25 2011-08-10 00:00:00
EDIT: it seems that for the real data, an error occured with empty list, so to solve it and reset_index
at the end:
df = (df.join(df.loc[df['relevant shocks'].str.len() > 0, 'relevant shocks']
.apply(pd.Series).stack().reset_index(1,name='relevant shock'))
.fillna('').drop('level_1',1).reset_index(drop=True))
Upvotes: 8