Souvik Ray
Souvik Ray

Reputation: 3018

Split a column containing a list into multiple rows in Pandas based on a condition

I have a data which looks like below

[(datetime.datetime(2021, 2, 6, 8, 18, 1, 212763),u'[["London", "New York", "BUSY"]]'), 
(datetime.datetime(2021, 2, 6, 8, 17, 1, 18633), u'[["Mumbai", "Tokyo", "IDLE"]]'), (datetime.datetime(2021, 2, 6, 8, 16, 1, 182888), u'[["Amsterdam", "Chicago", "IDLE"], ["Amsterdam", "London", "IDLE"], ["Amsterdam", "Berlin", "BUSY"]]'), (datetime.datetime(2021, 2, 6, 8, 15, 1, 245619), u'[["Tokyo", "Moscow", "IDLE"]]'), (datetime.datetime(2021, 2, 6, 7, 18, 1, 413066), u'[["Mumbai", "Los Angeles", "IDLE"], ["Mumbai", "Berlin", "IDLE"]]'), 
(datetime.datetime(2021, 2, 6, 7, 17, 1, 154138), u'[]'), 
(datetime.datetime(2021, 2, 6, 7, 16, 1, 253111), u'[]')]

It has two columns first a date and second a string list of list.

This is how it looks on loading it in pandas

                        date                                             status
0 2021-02-06 08:18:01.212763                   [["London", "New York", "BUSY"]]
1 2021-02-06 08:17:01.018633                      [["Mumbai", "Tokyo", "IDLE"]]
2 2021-02-06 08:16:01.182888  [["Amsterdam", "Chicago", "IDLE"], ["Amsterdam...
3 2021-02-06 08:15:01.245619                      [["Tokyo", "Moscow", "IDLE"]]
4 2021-02-06 07:18:01.413066  [["Mumbai", "Los Angeles", "IDLE"], ["Mumabi",...
5 2021-02-06 07:17:01.154138                                                 []
6 2021-02-06 07:16:01.253111                                                 []

There are two issues.

First I need to convert the string list of list into a regular list and this is achieved by doing something like this

df[column].apply(literal_eval)

The second issue is some column values contains multiple items in the list and I need to split each of them and make a new row containing that value as an independent list. Also each column value should be converted into a list rather than a list of list.

For example I have this particular column value

(datetime.datetime(2021, 2, 6, 8, 16, 1, 182888), u'[["Amsterdam", "Chicago", "IDLE"], ["Amsterdam", "Londom", "IDLE"], ["Amsterdam", "Berlin", "BUSY"]]')

Here each item in the list should form a new row containing that value in the column as a list. Something like below

2021-02-06 08:16:01.182888 ["Amsterdam", "Chicago", "IDLE"]
2021-02-06 08:16:01.182888 ["Amsterdam", "Londom", "IDLE"]
2021-02-06 08:16:01.182888 ["Amsterdam", "Berlin", "BUSY"]

Any other columns that have just a single item in the list like

(datetime.datetime(2021, 2, 6, 8, 18, 1, 212763),u'[["London", "New York", "BUSY"]]')

This should be converted to

2021-02-06 08:18:01.212763 ["London", "New York", "BUSY"]

So the final Dataframe should look like

                        date                                             status
0 2021-02-06 08:18:01.212763                   ["London", "New York", "BUSY"]
1 2021-02-06 08:17:01.018633                   ["Mumbai", "Tokyo", "IDLE"]
2 2021-02-06 08:16:01.182888                   ["Amsterdam", "Chicago", "IDLE"]
3 2021-02-06 08:16:01.182888                   ["Amsterdam", "London", "IDLE"]
4 2021-02-06 08:16:01.182888                   ["Amsterdam", "Berlin", "BUSY"]
5 2021-02-06 08:15:01.245619                   ["Tokyo", "Moscow", "IDLE"]
6 2021-02-06 07:18:01.413066                   ["Mumbai", "Los Angeles", "IDLE"]
7 2021-02-06 07:18:01.413066                   ["Mumbai", "Berlin", "IDLE"]
8 2021-02-06 07:17:01.154138                                                 []
9 2021-02-06 07:16:01.253111                                                 []

This is what I have done so far

import datetime
import pandas as pd
import json
from ast import literal_eval

df = pd.DataFrame(data)
df.columns = ["date", "status"]
df = df[df["status"] != '[]'] # remove empty lists
df['status'] = df['status'].apply(literal_eval) # convert string list of list into regular list

How do I perform the next set of operations as mentioned above?

Upvotes: 2

Views: 551

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71689

After using literal_eval to evaluate the strings in column status as python lists you can use:

For pandas version >= 0.25 you can use explode:

# Explode dataframe
df_out = df.explode('status').reset_index(drop=True)

# fill the NaN with empty lists
df_out['status'] = df_out['status'].dropna().reindex(df_out.index, fill_value=[])

For pandas version < 0.25 as explode is not available you can use replicate the explode like behaviour using index.repeat, then flatening out the nested lists using chain:

from itertools import chain

l = df['status'].str.len()
m = l > 0

df_out = df.reindex(df[m].index.repeat(l[m]))
df_out['status'] = list(chain(*df.loc[m, 'status']))
df_out = df_out.append(df[~m]).sort_index().reset_index(drop=True)

>>> df_out

                        date                       status
0 2021-02-06 08:18:01.212763     [London, New York, BUSY]
1 2021-02-06 08:17:01.018633        [Mumbai, Tokyo, IDLE]
2 2021-02-06 08:16:01.182888   [Amsterdam, Chicago, IDLE]
3 2021-02-06 08:16:01.182888    [Amsterdam, London, IDLE]
4 2021-02-06 08:16:01.182888    [Amsterdam, Berlin, BUSY]
5 2021-02-06 08:15:01.245619        [Tokyo, Moscow, IDLE]
6 2021-02-06 07:18:01.413066  [Mumbai, Los Angeles, IDLE]
7 2021-02-06 07:18:01.413066       [Mumbai, Berlin, IDLE]
8 2021-02-06 07:17:01.154138                           []
9 2021-02-06 07:16:01.253111                           []

Upvotes: 3

Related Questions