Reputation: 4144
I am digesting several csv files (each with one or more year of data) to categorize medical treatments into broad categories, while also keeping only a subset of the original information, and even aggregate up to a monthly number (by AR=year and month) of treatments per person (LopNr). Many treatments belong to different categories at the same time (multiple diagnosis codes are listed in the relevant column in the csv, thus I separate that field into a column of lists and categorize rows by any diagnosis codes belonging to a relevant range of ICD-9 codes).
I am using IOPro to save on memory, but I am still running into a segfault (still investigating). The text files are several GBs each, but this machine has 256 GB RAM. Either one of the packages is buggy, or I need a more memory efficient solution.
I am using versions pandas 0.16.2 np19py26_0, iopro 1.7.1 np19py27_p0, and python 2.7.10 0 under Linux.
So the original data would look something like this:
LopNr AR INDATUMA DIAGNOS …
1 2007 20070812 C32 F17
1 2007 20070816 C36
And I hope to see aggregates like this:
LopNr AR month tobacco …
1 2007 8 2
By the way, I would need Stata dta files in the end, but I go through cvs because pandas.DataFrame.to_stata seemed flaky in my experience, but maybe I am missing something there too.
# -*- coding: utf-8 -*-
import iopro
import numpy as np
from pandas import *
all_treatments = DataFrame()
filelist = ['oppenvard20012005','oppenvard20062010','oppenvard2011','oppenvard2012','slutenvard1997','slutenvard2011','slutenvard2012','slutenvard19982004','slutenvard20052010']
tobacco = lambda lst: any( (((x >= 'C30') and (x<'C40')) or ((x >= 'F17') and (x<'F18'))) for x in lst)
nutrition = lambda lst: any( (((x >= 'D50') and (x<'D54')) or ((x >= 'E10') and (x<'E15')) or ((x >= 'E40') and (x<'E47')) or ((x >= 'E50') and (x<'E69'))) for x in lst)
mental = lambda lst: any( (((x >= 'F') and (x<'G')) ) for x in lst)
alcohol = lambda lst: any( (((x >= 'F10') and (x<'F11')) or ((x >= 'K70') and (x<'K71'))) for x in lst)
circulatory = lambda lst: any( (((x >= 'I') and (x<'J')) ) for x in lst)
dental = lambda lst: any( (((x >= 'K02') and (x<'K04')) ) for x in lst)
accident = lambda lst: any( (((x >= 'V01') and (x<'X60')) ) for x in lst)
selfharm = lambda lst: any( (((x >= 'X60') and (x<'X85')) ) for x in lst)
cancer = lambda lst: any( (((x >= 'C') and (x<'D')) ) for x in lst)
endonutrimetab = lambda lst: any( (((x >= 'E') and (x<'F')) ) for x in lst)
pregnancy = lambda lst: any( (((x >= 'O') and (x<'P')) ) for x in lst)
other_stress = lambda lst: any( (((x >= 'J00') and (x<'J48')) or ((x >= 'L20') and (x<'L66')) or ((x >= 'K20') and (x<'K60')) or ((x >= 'R') and (x<'S')) or ((x >= 'X86') and (x<'Z77'))) for x in lst)
for file in filelist:
filename = 'PATH' + file +'.txt'
adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
treatments = adapter[['LopNr','AR','DIAGNOS','INDATUMA']][:]
treatments['month'] = treatments['INDATUMA'] % 10000
treatments['day'] = treatments['INDATUMA'] % 100
treatments['month'] = (treatments['month']-treatments['day'])/100
del treatments['day']
diagnoses = treatments['DIAGNOS'].str.split(' ')
del treatments['DIAGNOS']
treatments['tobacco'] = diagnoses.map(tobacco)
treatments['nutrition'] = diagnoses.map(nutrition)
treatments['mental'] = diagnoses.map(mental)
treatments['alcohol'] = diagnoses.map(alcohol)
treatments['circulatory'] = diagnoses.map(circulatory)
treatments['dental'] = diagnoses.map(dental)
treatments['accident'] = diagnoses.map(accident)
treatments['selfharm'] = diagnoses.map(selfharm)
treatments['cancer'] = diagnoses.map(cancer)
treatments['endonutrimetab'] = diagnoses.map(endonutrimetab)
treatments['pregnancy'] = diagnoses.map(pregnancy)
treatments['other_stress'] = diagnoses.map(other_stress)
all_treatments = all_treatments.append(treatments)
all_treatments = all_treatments.groupby(['LopNr','AR','month']).aggregate(np.count_nonzero) #.sum()
all_treatments = all_treatments.astype(int,copy=False,raise_on_error=False)
all_treatments.to_csv('PATH.csv')
Upvotes: 0
Views: 121
Reputation: 109626
A few comments:
def
.Example:
def tobacco(codes):
return any( 'C30' <= x < 'C40' or
'F17' <= x < 'F18' for x in codes)
You can also vectorize these functions as follows:
def tobacco(codes_column):
return [any('C30' <= code < 'C40' or
'F17' <= code < 'F18'
for code in codes) if codes else False
for codes in codes_column]
diagnoses = all_treatments['DIAGNOS'].str.split(' ').tolist()
all_treatments['tobacco'] = tobacco(diagnoses)
You initialize all_treatments
to a DataFrame, and then append to it. This is very inefficient. Try all_treatments = list()
, and add all_treatments = pd.concat(all_treatments, ignore_index=True)
outside the loop just before your groupby
. In addition, it should be all_treatments.append(treatments)
(vs. all_treatments = all_treatments.append(treatments)
)
To calculate the month for the purpose of grouping, you can use:
all_treatments['month'] = all_treatments.INDATUMA % 10000 // 100
Lastly, instead of applying your lambda functions to each file once its read, try applying them to the all_treatments
DataFrame instead.
p.s. You may also want to try .sum()
on your groupby
statement instead of .aggregate(np.count_nonzero)
Upvotes: 1
Reputation: 8906
I think you need to find a way to vectorize your solution. Using map
and lambda functions is not efficient and doesn't make use of the speed-ups that make pandas so useful. It's hard to say for sure because you haven't posted sample data, but I think a good starting point would be to do
diagnoses = treatments['DIAGNOS'].str.split(expand=True)
The result will be a data frame with a column for each word (or element in the result of split). You can then do vectorized comparisons against the whole DataFrame. It might looks something like this:
between_c_vals = (diagnoses >= 'C30') & (diagnoses <= 'C40')
between_f_vals = (diagnoses >= 'F17') & (diagnoses <= 'F18')
treatment['tobacco'] = (between_c_vals | between_f_vals).any(axis=1)
This should be hundreds of times faster than using .map
which uses loops in Python. Note that the bit operators &
and |
can be used to perform set logic on Boolean vectors and matrices (or DataFrames).
If you showed an example of treatment['DIAGNOS']
I could maybe help more. One thing to be careful of is NaN
values when doing comparisons as comparing NaN
to anything always return False
but I think it should be fine here is it won't return any unwanted True values.
Upvotes: 1