Reputation: 4799
I have pandas code which receives data from a REST API in JSON. Depending on the request, the data can be an empty array or some of the values may be missing (null values are not sent in those JSON packets).
I try to write pandas code which will handle this gracefully, without having to test specifically for empty
. I currently find it difficult to make this robust. For example, the code below runs fine except for one line:
import json
from unittest import TestCase
import dateutil
import numpy as np
import pandas as pd
import pytz
class TestClient(TestCase):
def test(self):
# I get data from a JSON API, but it can be empty (or missing some values if those are null)
json_dict = json.loads("[]")
dfo = pd.DataFrame.from_dict(json_dict)
# in case the json is partially empty create required columns
for col in ("from", "to", "value"):
if col not in dfo.columns:
dfo[col] = np.nan
# remove some duplicates - works fine
dfo.sort_values(by=['from', 'to'], inplace=True)
dfo.drop_duplicates(subset='from', keep='last', inplace=True)
# parse timestamps - works fine
dfo['from'] = dfo['from'].apply(dateutil.parser.parse)
dfo['to'] = dfo['to'].apply(dateutil.parser.parse)
# localize timestamps - works fine
local_tz = pytz.timezone("Europe/Zurich")
dfo['from_local'] = dfo['from'].apply(lambda dt: dt.astimezone(local_tz))
dfo['to_local'] = dfo['to'].apply(lambda dt: dt.astimezone(local_tz))
# some more datetime maths - works fine
dfo['duration'] = dfo['to'] - dfo['from']
# extract the date - fails
dfo['to_date'] = dfo['to_local'].dt.date # fails with AttributeError: Can only use .dt accessor with datetimelike values
# But I could use the code below instead, which does the same thing, and works
# dfo['to_date'] = dfo['to_local'].apply(lambda r: r.date())
# calculate some mean - works fine
the_mean = dfo['value'].mean() # OK, returns NaN
Can you recommend ways to handling possibly empty dataframes in a robust way? Are there best practices?
In the code above, could I declare data types to avoid the AttributeError
?
Is my expectation wrong, that the same processing should run as well on an empty dataframe? (and you really must imagine and test all possible corner cases)
Upvotes: 0
Views: 1226
Reputation: 30589
The problem is that empty columns in a newly created dataframe are of type float64
which is not datetimelike.
So the easiest way is to explicitly convert all column you are going to use a dt
assessor on to datetime
type:
dfo['to_local'] = pd.to_datetime(dfo['to_local'])
You need do this only once, e.g. after creation. If you later drop all rows from a dataframe and it becomes empty it will nevertheless keep its column types.
Upvotes: 1