Reputation: 101

How to keep track of different types of missing values in pandas?

Summary

It's important in many scientific applications to keep track of different kinds of missing value. Is a value for 'weekly income from main job' missing because the person doesn't have a job, or because they have a job but refused to answer?

Storing all missing values as NA or NaN loses this information.
Storing missing value labels (e.g., 'missing because no job', 'missing because refused to answer') in a separate column means the researcher must keep track of two columns for every operation she performs – such as groupby, renaming, and so on. This creates endless opportunities for mistakes and errors.
Storing missing value labels within the same column (e.g., as negative numbers, as in the example below, or very large numbers like 99999) means the researcher must manually keep track of how missing value labels are encoded for every column, and creates many other opportunities for mistakes (e.g., forgetting that a column includes missing values and taking a raw mean instead of using the correct mask).

It is very easy to handle this problem in Stata (see below), by using a data type that stores both numeric values and missing value labels, and with functions that know how to handle this data type. This is highly performant (data type remains numeric, not string or mixed – think of NumPy's data types, except instead of having just NaN we have NaN1, NaN2, etc.) What is the best way of achieving something like this in pandas?

Note: I'm an economist, but this is also an incredibly common workflow for political scientists, epidemiologists, etc. – anyone who deals with survey data. In this context, the analyst knows what the missing values are via a codebook, really cares about keeping track of them, and has hundreds or thousands of columns to deal with – so, indeed, needs an automated way of keeping track of them.

Motivation/context

It's extremely common when dealing with any kind of survey data to have multiple kinds of missing data.

Here is a minimal example from a government questionnaire used to produce official employment statistics:

[Q1] Do you have a job?
[Q2] [If Q1=Yes] What is your weekly income from that job?

The above occurs in pretty much every government-run labor force survey in the world (e.g., the UK Labour Force Survey, the US Current Population Survey, etc.).

Now, for a given respondent, if [Q2] is missing, it could be that (1) they answered No to [Q1], and so were ineligible to be asked [Q2], or that (2) they answered Yes to [Q1] but refused to answer [Q2] (perhaps because they were embarrassed at how much/little they earn, or because they didn't know).

As a researcher, it matters a great deal to me whether it was (1) that occurred, or whether it was (2). Suppose my job is to report the average weekly income of workers in the United States. If there are many missing values for this [Q2] column, but they are all labeled 'missing because respondent answered no to [Q1]', then I can take the average of [Q2] with confidence – it is, indeed, the average weekly income of people in work. (All the missing values are people who didn't have a job.)

On the other hand, if those [Q2] missing values are all labeled 'missing because respondent was asked this question but refused to answer', then I cannot simply report the average of [Q2] as the average weekly income of workers. I'll need to issue caveats around my results. I'll need to analyze the kinds of people who don't answer (are they missing at random, or are people in higher-income occupations more likely to refuse, for example, biasing my results?). Possibly I'll try to impute missing values, and so on.

The problem

Because these 'reasons for being missing' are so important, government statistical agencies will code the different reasons within the column:

So the column containing the answers to [Q2] above might contain the values [1500, -8, 10000, -2, 3000, -1, 6400].

In this case, '1500', '10000', and so on are 'true' answers to [Q2] ($1,500 weekly income, $10,000 weekly income, etc.); whereas '-8' means they weren't eligible to answer (because they answered No to [Q1]) '-2' means they were eligible to answer but refused to do so, and so on.

Now, obviously, if I take the average of this column, I'm going to get something meaningless.

On the other hand, if I just replace all negative values with NaN, then I can take the average – but I've lost all this valuable information about why values are missing. For example, I may want to have a function that takes any column and reports, for that column, statistics like the mean and median, the number of eligible observations (i.e., everything except value=-8), and the percent of those that were non-missing.

It works great in Stata

Doing this in Stata is extremely easy. Stata has 27 numeric missing categories: '.a' to '.z'. (More details here.) I can write:

replace weekly_income = .a if weekly_income == -1 replace weekly_income = .b if weekly_income == -8

and so on.

Then (in pseudocode) I can write

stats weekly_income if weekly_income!=.b

When reporting the mean, Stata will automatically ignore the values coded as missing (indeed, they're now not numeric); but it will also give me missing value statistics only for the observations I care about (in this case, those eligible to be asked the question, i.e., those who weren't originally coded '-8').

What is the best way to handle this in Pandas?

Setup:

>>> import pandas as pd
>>> df = pd.DataFrame.from_dict({
        'income': [1500, -8, 10000, -2, 3000, -1, 6400]})

Desired outcome:

>>> df.income.missing_dict = {'-1': ['.a', 'Don\'t know'], '-2': ['.b', 'Refused']} # etc.
>>> df
  income
0       1500
1  Inapplic.
2      10000
3    Refused
4       3000
5 Don't know
6       6400

>>> assert df.income.mean() == np.mean([1500, 10000, 3000, 6400])
(passes)

The 'obvious' workaround

Clearly, one option is to split every column into two columns: one numeric column with non-missing values and NaNs, and the other a categorical column with categories for the different types of missing value.

But this is extremely inconvenient. These surveys often have thousands of columns, and a researcher might well use hundreds in certain kinds of economic analysis. Having two columns for every 'underlying' column means the researcher has to keep track of two columns for every operation she performs – such as groupby, renaming, and so on. This creates endless opportunities for mistakes and errors. It also means that displaying the table is very wasteful – for any column, I need to now display two columns, one of which for any given observation is always redundant. (This is wasteful both of screen real estate, and of the human analysts' attention, having to identify which two columns are a 'pair'.)

Other ideas

Two other thoughts that occur to me, both probably non-ideal:

(1) Create a new data type in pandas that works similarly to Stata (i.e., adds '.a', '.b', etc. to allowable values for numeric columns).

(2) Use the two-columns solution above, but (re)write 'wrapper' functions in pandas so that 'groupby' etc. keeps track of the pairs of columns for me.

I suspect that (1) is the best solution for the long term, but it would presumably require a huge amount of development.

On the other hand, maybe there are already packages that solve this? Or people have better work-arounds?

Upvotes: 3

Answers (2)

BallpointBen

Reputation: 13974

Pandas recently introduced a custom array type called ExtensionArray that allows defining what is in essence a custom column type, allowing you to (sort of) use actual values alongside missing data without dealing with two columns. Here is a very, very crude implementation, which has barely been tested:

import numpy as np
import pandas as pd
from pandas.core.arrays.base import ExtensionArray


class StataData(ExtensionArray):
    def __init__(
        self, data, missing=None, factors=None, dtype=None, copy=False
    ):
        def own(array, dtype=dtype):
            array = np.asarray(array, dtype)
            if copy:
                array = array.copy()
            return array

        self.data = own(data)

        if missing is None:
            missing = np.zeros_like(data, dtype=int)
        else:
            missing = own(missing, dtype=int)
        self.missing = missing

        self.factors = own(factors)

    @classmethod
    def _from_sequence(cls, scalars, dtype=None, copy=False):
        return cls(scalars, dtype=dtype, copy=copy)

    @classmethod
    def _from_factorized(cls, data, original):
        return cls(original, None, data)

    def __getitem__(self, key):
        return type(self)(
            self.data[key], self.missing[key], self.factors
        )

    def __setitem__(self, key, value):
        self.data[key] = value
        self.missing[key] = 0

    def __len__(self):
        return len(self.data)

    def __iter__(self):
        return iter(self.data)

    @property
    def dtype(self):
        return self.data.dtype

    @property
    def shape(self):
        return self.data.shape

    @property
    def nbytes(self):
        return self.data.nbytes + self.missing.nbytes + self.factors.nbytes

    def view(self):
        return self

    @property
    def reason_missing(self):
        return self.missing

    def isna(self):
        return self.missing != 0

    def __repr__(self):
        s = {}
        for attr in ['data', 'missing', 'factors']:
            s[attr] = getattr(self, attr)
        return repr(s)

With this implementation, you can do the following:

>>> a = StataData([1, 2, 3, 4], [0, 0, 1, 0])
>>> s = pd.Series(a)
>>> print(s[s.isna()])
2    3
dtype: int32
>>> print(s[~s.isna()])
0    1
1    2
3    4
dtype: int32
>>> print(s.isna().values.reason_missing)
array([1])

Hopefully someone who understands this API can chime in and help improve this. For starters, a cannot be used in DataFrames, only Series.

>>> print(pd.DataFrame({'a': s}).isna())
0  False
1  False
2  False
3  False

Upvotes: 0

Prune

Reputation: 77910

To show the solution, I'm taking the liberty of changing the missing_dict keys to match the data type of income.

>>> df
   income
0    1500
1      -8
2   10000
3      -2
4    3000
5      -1
6    6400
>>> df.income.missing_dict
{-8: ['.c', 'Stifled by companion'], -2: ['.b', 'Refused'], -1: ['.a', "Don't know"]}

Now, here's how to filter the rows according to the values being in the "missing" list:

>>> df[(~df.income.isin((df.income.missing_dict)))]
   income
0    1500
2   10000
4    3000
6    6400

Note the extra parentheses around the filter values: we have to pass a tuple of values to isin. Then apply the tilde operator, bit-wise negation, to get a series of Booleans.

Finally, apply mean to the resulting data column:

>>> df[(~df.income.isin((df.income.missing_dict)))].mean()
income    5225.0
dtype: float64

Does that toss you in the right direction? From here, you can simply replace income with the appropriate column or variable name as needed.

Upvotes: 1