Julia
Julia

Reputation: 109

Duplicate rows in pandas dataframe based on list and fill new column with list entries

I have a question regarding duplicating rows in a pandas dataframe. I have allocated relevant dates to each observation in the column "relevant shocks" in lists. Observation 22 has an empty list, 23 a list with one date, 24 a list with two dates and 25 a list with three dates (as seen in column "listlength").

My aim is to expand the dataframe in the way that observations with empty lists remain in the dataset with one row while rows with x observations get duplicated x times - as such, rows 22 and 23 should stay in the dataframe once (22 despite the empty list and 23 because it has one relevant date), row 24 should get duplicated once and thus be in the dataframe twice and observation 25 should be duplicated twice and thus be in the dataframe thrice. As such, each row should be in the dataframe as many times as it has relevant shocks (as measured by listlength). Except for the ones with list length 0, they should still remain in the dataframe.

Further, I want to create a new column "relevant shock" which is filled by each of the relevant shocks once and separately.

This is the current dataframe:

    quarter year    pddate      relevant shocks                                                 listlength
22  1       2012    2012-02-15  []                                                              0.0
23  4       2011    2011-11-15  [2011-08-18 00:00:00]                                           1.0
24  3       2011    2011-08-15  [2011-08-18 00:00:00, 2011-09-22 00:00:00]                      2.0
25  2       2011    2011-05-13  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00  3.0

The new dataframe should have 7 rows and look as follows:

    quarter year    pddate      relevant shocks                                                 listlength    relevant shock
22  1       2012    2012-02-15  []                                                              0.0
23  4       2011    2011-11-15  [2011-08-18 00:00:00]                                           1.0           2011-08-18 00:00:00
24  3       2011    2011-08-15  [2011-08-18 00:00:00, 2011-09-22 00:00:00]                      2.0           2011-08-18 00:00:00
25  3       2011    2011-08-15  [2011-08-18 00:00:00, 2011-09-22 00:00:00]                      2.0           2011-09-22 00:00:00
26  2       2011    2011-05-13  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00  3.0           2011-08-04 00:00:00
27  2       2011    2011-05-13  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00  3.0           2011-08-08 00:00:00
28  2       2011    2011-05-13  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00  3.0           2011-08-10 00:00:00

So the basic idea would be to add the new column "relevant shock", go through each row, keep it unchanged if it has an empty list in "relevant shocks", also keep it unchanged if it has one date in "relevant shocks", but fill the new column "relevant shock" with that one list entry, duplicate it if it has two list entries in "relevant shocks" and fill the column "relevant shock" in each row with one of the two list entries, respectively, and so on.

Is this possible with Python?

Upvotes: 3

Views: 5466

Answers (2)

Ben.T
Ben.T

Reputation: 29635

EDIT for pandas version >= 0.25, a new method explode would do the job really easily:

#first create a copy of the column
df['relevant shock'] = df['relevant shocks']
#explode the new column
df = df.explode('relevant shock').fillna('')
print (df)
#same result than the one below

Old answer

From the column 'relevant shocks' you can use apply, pd.Series and stack to create a row for each date, such as:

df['relevant shocks'].apply(pd.Series).stack()
Out[448]: 
23  0    2011-08-18 00:00:00
24  0    2011-08-18 00:00:00
    1    2011-09-22 00:00:00
25  0    2011-08-04 00:00:00
    1    2011-08-08 00:00:00
    2    2011-08-10 00:00:00
dtype: object

I know the one empty is missing but after you join the result to your df with a reset_index, fillna and drop the extra column. With a df like this:

df = pd.DataFrame({'quarter':[1,2,3,4],
                   'relevant shocks':[[],['2011-08-18 00:00:00'],
                      ['2011-08-18 00:00:00', '2011-09-22 00:00:00'],
                      ['2011-08-04 00:00:00', '2011-08-08 00:00:00', '2011-08-10 00:00:00']]},
                   index=[22,23,24,25])

then you do:

df = (df.join(df['relevant shocks'].apply(pd.Series).stack()
                                   .reset_index(1,name='relevant shock'))
         .fillna('').drop('level_1',1))

and you get:

    quarter                                    relevant shocks  \
22        1                                                 []   
23        2                              [2011-08-18 00:00:00]   
24        3         [2011-08-18 00:00:00, 2011-09-22 00:00:00]   
24        3         [2011-08-18 00:00:00, 2011-09-22 00:00:00]   
25        4  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...   
25        4  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...   
25        4  [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...   

         relevant shock  
22                       
23  2011-08-18 00:00:00  
24  2011-08-18 00:00:00  
24  2011-09-22 00:00:00  
25  2011-08-04 00:00:00  
25  2011-08-08 00:00:00  
25  2011-08-10 00:00:00  

EDIT: it seems that for the real data, an error occured with empty list, so to solve it and reset_index at the end:

df = (df.join(df.loc[df['relevant shocks'].str.len() > 0, 'relevant shocks']
                .apply(pd.Series).stack().reset_index(1,name='relevant shock'))
        .fillna('').drop('level_1',1).reset_index(drop=True))

Upvotes: 8

Vishma Dias
Vishma Dias

Reputation: 700

Now can use pandas.DataFrame.explode

Upvotes: 0

Related Questions