Philipp
Philipp

Reputation: 4799

How to handle empty pandas dataframes gracefully

I have pandas code which receives data from a REST API in JSON. Depending on the request, the data can be an empty array or some of the values may be missing (null values are not sent in those JSON packets).

I try to write pandas code which will handle this gracefully, without having to test specifically for empty. I currently find it difficult to make this robust. For example, the code below runs fine except for one line:

import json
from unittest import TestCase

import dateutil
import numpy as np
import pandas as pd
import pytz


class TestClient(TestCase):
    def test(self):

        # I get data from a JSON API, but it can be empty (or missing some values if those are null)        
        json_dict = json.loads("[]")
        dfo = pd.DataFrame.from_dict(json_dict)
        # in case the json is partially empty create required columns
        for col in ("from", "to", "value"):
            if col not in dfo.columns:
                dfo[col] = np.nan

        # remove some duplicates - works fine
        dfo.sort_values(by=['from', 'to'], inplace=True)
        dfo.drop_duplicates(subset='from', keep='last', inplace=True)

        # parse timestamps - works fine
        dfo['from'] = dfo['from'].apply(dateutil.parser.parse)
        dfo['to'] = dfo['to'].apply(dateutil.parser.parse)

        # localize timestamps - works fine
        local_tz = pytz.timezone("Europe/Zurich")
        dfo['from_local'] = dfo['from'].apply(lambda dt: dt.astimezone(local_tz))
        dfo['to_local'] = dfo['to'].apply(lambda dt: dt.astimezone(local_tz))

        # some more datetime maths - works fine
        dfo['duration'] = dfo['to'] - dfo['from']

        # extract the date - fails
        dfo['to_date'] = dfo['to_local'].dt.date # fails with AttributeError: Can only use .dt accessor with datetimelike values
        # But I could use the code below instead, which does the same thing, and works
        # dfo['to_date'] = dfo['to_local'].apply(lambda r: r.date())

        # calculate some mean - works fine 
        the_mean = dfo['value'].mean() # OK, returns NaN

Can you recommend ways to handling possibly empty dataframes in a robust way? Are there best practices?

In the code above, could I declare data types to avoid the AttributeError?

Is my expectation wrong, that the same processing should run as well on an empty dataframe? (and you really must imagine and test all possible corner cases)

Upvotes: 0

Views: 1226

Answers (1)

Stef
Stef

Reputation: 30589

The problem is that empty columns in a newly created dataframe are of type float64 which is not datetimelike.
So the easiest way is to explicitly convert all column you are going to use a dt assessor on to datetime type:

dfo['to_local'] = pd.to_datetime(dfo['to_local'])

You need do this only once, e.g. after creation. If you later drop all rows from a dataframe and it becomes empty it will nevertheless keep its column types.

Upvotes: 1

Related Questions