Reputation: 3018
I have a data which looks like below
[(datetime.datetime(2021, 2, 6, 8, 18, 1, 212763),u'[["London", "New York", "BUSY"]]'),
(datetime.datetime(2021, 2, 6, 8, 17, 1, 18633), u'[["Mumbai", "Tokyo", "IDLE"]]'), (datetime.datetime(2021, 2, 6, 8, 16, 1, 182888), u'[["Amsterdam", "Chicago", "IDLE"], ["Amsterdam", "London", "IDLE"], ["Amsterdam", "Berlin", "BUSY"]]'), (datetime.datetime(2021, 2, 6, 8, 15, 1, 245619), u'[["Tokyo", "Moscow", "IDLE"]]'), (datetime.datetime(2021, 2, 6, 7, 18, 1, 413066), u'[["Mumbai", "Los Angeles", "IDLE"], ["Mumbai", "Berlin", "IDLE"]]'),
(datetime.datetime(2021, 2, 6, 7, 17, 1, 154138), u'[]'),
(datetime.datetime(2021, 2, 6, 7, 16, 1, 253111), u'[]')]
It has two columns first a date
and second a string list of list
.
This is how it looks on loading it in pandas
date status
0 2021-02-06 08:18:01.212763 [["London", "New York", "BUSY"]]
1 2021-02-06 08:17:01.018633 [["Mumbai", "Tokyo", "IDLE"]]
2 2021-02-06 08:16:01.182888 [["Amsterdam", "Chicago", "IDLE"], ["Amsterdam...
3 2021-02-06 08:15:01.245619 [["Tokyo", "Moscow", "IDLE"]]
4 2021-02-06 07:18:01.413066 [["Mumbai", "Los Angeles", "IDLE"], ["Mumabi",...
5 2021-02-06 07:17:01.154138 []
6 2021-02-06 07:16:01.253111 []
There are two issues.
First I need to convert the string list of list
into a regular list
and this is achieved by doing something like this
df[column].apply(literal_eval)
The second issue is some column values contains multiple items in the list
and I need to split each of them and make a new row containing that value as an independent list
. Also each column value should be converted into a list
rather than a list of list
.
For example I have this particular column value
(datetime.datetime(2021, 2, 6, 8, 16, 1, 182888), u'[["Amsterdam", "Chicago", "IDLE"], ["Amsterdam", "Londom", "IDLE"], ["Amsterdam", "Berlin", "BUSY"]]')
Here each item in the list
should form a new row containing that value in the column as a list
. Something like below
2021-02-06 08:16:01.182888 ["Amsterdam", "Chicago", "IDLE"]
2021-02-06 08:16:01.182888 ["Amsterdam", "Londom", "IDLE"]
2021-02-06 08:16:01.182888 ["Amsterdam", "Berlin", "BUSY"]
Any other columns that have just a single item in the list
like
(datetime.datetime(2021, 2, 6, 8, 18, 1, 212763),u'[["London", "New York", "BUSY"]]')
This should be converted to
2021-02-06 08:18:01.212763 ["London", "New York", "BUSY"]
So the final Dataframe should look like
date status
0 2021-02-06 08:18:01.212763 ["London", "New York", "BUSY"]
1 2021-02-06 08:17:01.018633 ["Mumbai", "Tokyo", "IDLE"]
2 2021-02-06 08:16:01.182888 ["Amsterdam", "Chicago", "IDLE"]
3 2021-02-06 08:16:01.182888 ["Amsterdam", "London", "IDLE"]
4 2021-02-06 08:16:01.182888 ["Amsterdam", "Berlin", "BUSY"]
5 2021-02-06 08:15:01.245619 ["Tokyo", "Moscow", "IDLE"]
6 2021-02-06 07:18:01.413066 ["Mumbai", "Los Angeles", "IDLE"]
7 2021-02-06 07:18:01.413066 ["Mumbai", "Berlin", "IDLE"]
8 2021-02-06 07:17:01.154138 []
9 2021-02-06 07:16:01.253111 []
This is what I have done so far
import datetime
import pandas as pd
import json
from ast import literal_eval
df = pd.DataFrame(data)
df.columns = ["date", "status"]
df = df[df["status"] != '[]'] # remove empty lists
df['status'] = df['status'].apply(literal_eval) # convert string list of list into regular list
How do I perform the next set of operations as mentioned above?
Upvotes: 2
Views: 551
Reputation: 71689
After using literal_eval
to evaluate the strings in column status
as python lists you can use:
For pandas version >= 0.25
you can use explode
:
# Explode dataframe
df_out = df.explode('status').reset_index(drop=True)
# fill the NaN with empty lists
df_out['status'] = df_out['status'].dropna().reindex(df_out.index, fill_value=[])
For pandas version < 0.25
as explode
is not available you can use replicate the explode like behaviour using index.repeat
, then flatening out the nested lists using chain
:
from itertools import chain
l = df['status'].str.len()
m = l > 0
df_out = df.reindex(df[m].index.repeat(l[m]))
df_out['status'] = list(chain(*df.loc[m, 'status']))
df_out = df_out.append(df[~m]).sort_index().reset_index(drop=True)
>>> df_out
date status
0 2021-02-06 08:18:01.212763 [London, New York, BUSY]
1 2021-02-06 08:17:01.018633 [Mumbai, Tokyo, IDLE]
2 2021-02-06 08:16:01.182888 [Amsterdam, Chicago, IDLE]
3 2021-02-06 08:16:01.182888 [Amsterdam, London, IDLE]
4 2021-02-06 08:16:01.182888 [Amsterdam, Berlin, BUSY]
5 2021-02-06 08:15:01.245619 [Tokyo, Moscow, IDLE]
6 2021-02-06 07:18:01.413066 [Mumbai, Los Angeles, IDLE]
7 2021-02-06 07:18:01.413066 [Mumbai, Berlin, IDLE]
8 2021-02-06 07:17:01.154138 []
9 2021-02-06 07:16:01.253111 []
Upvotes: 3