Best data structure for sparse data with multiple dimensions

Question

I want to structure my data (similarly to pandas) to allow easy data exploration. I tried using xarray.DataArray for this task (the recommended way to represent n-dimensional data in pandas http://pandas.pydata.org/pandas-docs/stable/dsintro.html#panel4d-and-panelnd-deprecated) but it appears inefficient given that my data is sparse. Is there a better way to structure my data under xarray.DataArray or under another Python data structure to allow easy data exploration?

Description of data

My data consists of prescriptions given to patients. Each entry consists of:

Date (datetime64)
Patient Id (int)
Drug name (string)
Drug type (string)
Drug class (string)
Scheduled dosage (real value)
Dosage as needed (real value)

There might be several prescriptions on a date for different patients. A patient might also be prescribed several drugs (e.g., 2-3 drugs) at the same time with 'mandatory' dosage and 'optional/as needed' dosage. My dataset currently consists of 397 different patients, 1520 different dates and 161 different drugs. I only have 21790 non-null entries out of the 397*1520*161*2 entries (i.e., 0.01%).

Initial code

My data is currently organized as the following xarray.DataArray:

drugs = xarray.DataArray(dosages, coords={'patient': patients, 'time': dates, 
                                          'drug': drug_names, 'timing': timings, 
                                          'drug_type': ('drug', drug_types), 
                                          'drug_class': ('drug', drug_classes)},
                         dims=['patient', 'time', 'drug', 'timing'])

where dosages.shape = (len(patients), len(dates), len(drug_names), 2). The timing axis corresponds to 'scheduled' vs. 'as needed' dosage. All the missing/zero entries are set to numpy.nan.

Best data structure for sparse data with multiple dimensions

Answers (1)

Related Questions