Reputation: 101
It's important in many scientific applications to keep track of different kinds of missing value. Is a value for 'weekly income from main job' missing because the person doesn't have a job, or because they have a job but refused to answer?
NA
or NaN
loses this information.It is very easy to handle this problem in Stata
(see below), by using a data type that stores both numeric values and missing value labels, and with functions that know how to handle this data type. This is highly performant (data type remains numeric, not string or mixed – think of NumPy's data types, except instead of having just NaN
we have NaN1
, NaN2
, etc.) What is the best way of achieving something like this in pandas?
Note: I'm an economist, but this is also an incredibly common workflow for political scientists, epidemiologists, etc. – anyone who deals with survey data. In this context, the analyst knows what the missing values are via a codebook, really cares about keeping track of them, and has hundreds or thousands of columns to deal with – so, indeed, needs an automated way of keeping track of them.
It's extremely common when dealing with any kind of survey data to have multiple kinds of missing data.
Here is a minimal example from a government questionnaire used to produce official employment statistics:
The above occurs in pretty much every government-run labor force survey in the world (e.g., the UK Labour Force Survey, the US Current Population Survey, etc.).
Now, for a given respondent, if [Q2] is missing, it could be that (1) they answered No to [Q1], and so were ineligible to be asked [Q2], or that (2) they answered Yes to [Q1] but refused to answer [Q2] (perhaps because they were embarrassed at how much/little they earn, or because they didn't know).
As a researcher, it matters a great deal to me whether it was (1) that occurred, or whether it was (2). Suppose my job is to report the average weekly income of workers in the United States. If there are many missing values for this [Q2] column, but they are all labeled 'missing because respondent answered no to [Q1]', then I can take the average of [Q2] with confidence – it is, indeed, the average weekly income of people in work. (All the missing values are people who didn't have a job.)
On the other hand, if those [Q2] missing values are all labeled 'missing because respondent was asked this question but refused to answer', then I cannot simply report the average of [Q2] as the average weekly income of workers. I'll need to issue caveats around my results. I'll need to analyze the kinds of people who don't answer (are they missing at random, or are people in higher-income occupations more likely to refuse, for example, biasing my results?). Possibly I'll try to impute missing values, and so on.
Because these 'reasons for being missing' are so important, government statistical agencies will code the different reasons within the column:
So the column containing the answers to [Q2] above might contain the values [1500, -8, 10000, -2, 3000, -1, 6400].
In this case, '1500', '10000', and so on are 'true' answers to [Q2] ($1,500 weekly income, $10,000 weekly income, etc.); whereas '-8' means they weren't eligible to answer (because they answered No to [Q1]) '-2' means they were eligible to answer but refused to do so, and so on.
Now, obviously, if I take the average of this column, I'm going to get something meaningless.
On the other hand, if I just replace all negative values with NaN
, then I can take the average – but I've lost all this valuable information about why values are missing. For example, I may want to have a function that takes any column and reports, for that column, statistics like the mean and median, the number of eligible observations (i.e., everything except value=-8), and the percent of those that were non-missing.
Doing this in Stata
is extremely easy. Stata
has 27 numeric missing categories: '.a' to '.z'. (More details here.) I can write:
replace weekly_income = .a if weekly_income == -1
replace weekly_income = .b if weekly_income == -8
and so on.
Then (in pseudocode) I can write
stats weekly_income if weekly_income!=.b
When reporting the mean, Stata
will automatically ignore the values coded as missing (indeed, they're now not numeric); but it will also give me missing value statistics only for the observations I care about (in this case, those eligible to be asked the question, i.e., those who weren't originally coded '-8').
Setup:
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict({
'income': [1500, -8, 10000, -2, 3000, -1, 6400]})
Desired outcome:
>>> df.income.missing_dict = {'-1': ['.a', 'Don\'t know'], '-2': ['.b', 'Refused']} # etc.
>>> df
income
0 1500
1 Inapplic.
2 10000
3 Refused
4 3000
5 Don't know
6 6400
>>> assert df.income.mean() == np.mean([1500, 10000, 3000, 6400])
(passes)
Clearly, one option is to split every column into two columns: one numeric column with non-missing values and NaNs, and the other a categorical column with categories for the different types of missing value.
But this is extremely inconvenient. These surveys often have thousands of columns, and a researcher might well use hundreds in certain kinds of economic analysis. Having two columns for every 'underlying' column means the researcher has to keep track of two columns for every operation she performs – such as groupby, renaming, and so on. This creates endless opportunities for mistakes and errors. It also means that displaying the table is very wasteful – for any column, I need to now display two columns, one of which for any given observation is always redundant. (This is wasteful both of screen real estate, and of the human analysts' attention, having to identify which two columns are a 'pair'.)
Two other thoughts that occur to me, both probably non-ideal:
(1) Create a new data type in pandas that works similarly to Stata (i.e., adds '.a', '.b', etc. to allowable values for numeric columns).
(2) Use the two-columns solution above, but (re)write 'wrapper' functions in pandas so that 'groupby' etc. keeps track of the pairs of columns for me.
I suspect that (1) is the best solution for the long term, but it would presumably require a huge amount of development.
On the other hand, maybe there are already packages that solve this? Or people have better work-arounds?
Upvotes: 3
Views: 470
Reputation: 13750
Pandas recently introduced a custom array type called ExtensionArray
that allows defining what is in essence a custom column type, allowing you to (sort of) use actual values alongside missing data without dealing with two columns. Here is a very, very crude implementation, which has barely been tested:
import numpy as np
import pandas as pd
from pandas.core.arrays.base import ExtensionArray
class StataData(ExtensionArray):
def __init__(
self, data, missing=None, factors=None, dtype=None, copy=False
):
def own(array, dtype=dtype):
array = np.asarray(array, dtype)
if copy:
array = array.copy()
return array
self.data = own(data)
if missing is None:
missing = np.zeros_like(data, dtype=int)
else:
missing = own(missing, dtype=int)
self.missing = missing
self.factors = own(factors)
@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
return cls(scalars, dtype=dtype, copy=copy)
@classmethod
def _from_factorized(cls, data, original):
return cls(original, None, data)
def __getitem__(self, key):
return type(self)(
self.data[key], self.missing[key], self.factors
)
def __setitem__(self, key, value):
self.data[key] = value
self.missing[key] = 0
def __len__(self):
return len(self.data)
def __iter__(self):
return iter(self.data)
@property
def dtype(self):
return self.data.dtype
@property
def shape(self):
return self.data.shape
@property
def nbytes(self):
return self.data.nbytes + self.missing.nbytes + self.factors.nbytes
def view(self):
return self
@property
def reason_missing(self):
return self.missing
def isna(self):
return self.missing != 0
def __repr__(self):
s = {}
for attr in ['data', 'missing', 'factors']:
s[attr] = getattr(self, attr)
return repr(s)
With this implementation, you can do the following:
>>> a = StataData([1, 2, 3, 4], [0, 0, 1, 0])
>>> s = pd.Series(a)
>>> print(s[s.isna()])
2 3
dtype: int32
>>> print(s[~s.isna()])
0 1
1 2
3 4
dtype: int32
>>> print(s.isna().values.reason_missing)
array([1])
Hopefully someone who understands this API can chime in and help improve this. For starters, a
cannot be used in DataFrames
, only Series
.
>>> print(pd.DataFrame({'a': s}).isna())
0 False
1 False
2 False
3 False
Upvotes: 0
Reputation: 77847
To show the solution, I'm taking the liberty of changing the missing_dict
keys to match the data type of income
.
>>> df
income
0 1500
1 -8
2 10000
3 -2
4 3000
5 -1
6 6400
>>> df.income.missing_dict
{-8: ['.c', 'Stifled by companion'], -2: ['.b', 'Refused'], -1: ['.a', "Don't know"]}
Now, here's how to filter the rows according to the values being in the "missing" list:
>>> df[(~df.income.isin((df.income.missing_dict)))]
income
0 1500
2 10000
4 3000
6 6400
Note the extra parentheses around the filter values: we have to pass a tuple
of values to isin
. Then apply the tilde operator, bit-wise negation, to get a series of Booleans.
Finally, apply mean
to the resulting data column:
>>> df[(~df.income.isin((df.income.missing_dict)))].mean()
income 5225.0
dtype: float64
Does that toss you in the right direction? From here, you can simply replace income
with the appropriate column or variable name as needed.
Upvotes: 1