László
László

Reputation: 4144

improve upon mapped lambdas in Python (pandas)

I am digesting several csv files (each with one or more year of data) to categorize medical treatments into broad categories, while also keeping only a subset of the original information, and even aggregate up to a monthly number (by AR=year and month) of treatments per person (LopNr). Many treatments belong to different categories at the same time (multiple diagnosis codes are listed in the relevant column in the csv, thus I separate that field into a column of lists and categorize rows by any diagnosis codes belonging to a relevant range of ICD-9 codes).

I am using IOPro to save on memory, but I am still running into a segfault (still investigating). The text files are several GBs each, but this machine has 256 GB RAM. Either one of the packages is buggy, or I need a more memory efficient solution.

I am using versions pandas 0.16.2 np19py26_0, iopro 1.7.1 np19py27_p0, and python 2.7.10 0 under Linux.

So the original data would look something like this:

LopNr   AR INDATUMA DIAGNOS …
1     2007 20070812 C32 F17
1     2007 20070816     C36

And I hope to see aggregates like this:

LopNr   AR month tobacco …
1     2007     8       2

By the way, I would need Stata dta files in the end, but I go through cvs because pandas.DataFrame.to_stata seemed flaky in my experience, but maybe I am missing something there too.

# -*- coding: utf-8 -*-
import iopro
import numpy as np
from pandas import *

all_treatments  = DataFrame()
filelist = ['oppenvard20012005','oppenvard20062010','oppenvard2011','oppenvard2012','slutenvard1997','slutenvard2011','slutenvard2012','slutenvard19982004','slutenvard20052010']

tobacco = lambda lst: any( (((x >= 'C30') and (x<'C40')) or ((x >= 'F17') and (x<'F18')))  for x in lst)
nutrition = lambda lst: any( (((x >= 'D50') and (x<'D54')) or ((x >= 'E10') and (x<'E15')) or ((x >= 'E40') and (x<'E47')) or ((x >= 'E50') and (x<'E69')))  for x in lst)
mental = lambda lst: any( (((x >= 'F') and (x<'G')) )  for x in lst)
alcohol = lambda lst: any( (((x >= 'F10') and (x<'F11')) or ((x >= 'K70') and (x<'K71')))  for x in lst)
circulatory = lambda lst: any( (((x >= 'I') and (x<'J')) )  for x in lst)
dental = lambda lst: any( (((x >= 'K02') and (x<'K04')) )  for x in lst)
accident = lambda lst: any( (((x >= 'V01') and (x<'X60')) )  for x in lst)
selfharm = lambda lst: any( (((x >= 'X60') and (x<'X85')) )  for x in lst)
cancer = lambda lst: any( (((x >= 'C') and (x<'D')) )  for x in lst)
endonutrimetab = lambda lst: any( (((x >= 'E') and (x<'F')) )  for x in lst)
pregnancy = lambda lst: any( (((x >= 'O') and (x<'P')) )  for x in lst)
other_stress = lambda lst: any( (((x >= 'J00') and (x<'J48')) or ((x >= 'L20') and (x<'L66')) or ((x >= 'K20') and (x<'K60')) or ((x >= 'R') and (x<'S')) or ((x >= 'X86') and (x<'Z77')))  for x in lst)

for file in filelist:
    filename = 'PATH' + file +'.txt'
    adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
    treatments = adapter[['LopNr','AR','DIAGNOS','INDATUMA']][:]
    treatments['month'] = treatments['INDATUMA'] % 10000
    treatments['day'] = treatments['INDATUMA'] % 100
    treatments['month'] = (treatments['month']-treatments['day'])/100  
    del treatments['day']
    diagnoses = treatments['DIAGNOS'].str.split(' ')
    del treatments['DIAGNOS']
    treatments['tobacco'] = diagnoses.map(tobacco)
    treatments['nutrition'] = diagnoses.map(nutrition)
    treatments['mental'] = diagnoses.map(mental)
    treatments['alcohol'] = diagnoses.map(alcohol)
    treatments['circulatory'] = diagnoses.map(circulatory)
    treatments['dental'] = diagnoses.map(dental)
    treatments['accident'] = diagnoses.map(accident)
    treatments['selfharm'] = diagnoses.map(selfharm)
    treatments['cancer'] = diagnoses.map(cancer)
    treatments['endonutrimetab'] = diagnoses.map(endonutrimetab)
    treatments['pregnancy'] = diagnoses.map(pregnancy)
    treatments['other_stress'] = diagnoses.map(other_stress)
    all_treatments = all_treatments.append(treatments)
all_treatments = all_treatments.groupby(['LopNr','AR','month']).aggregate(np.count_nonzero) #.sum()
all_treatments = all_treatments.astype(int,copy=False,raise_on_error=False)
all_treatments.to_csv('PATH.csv')

Upvotes: 0

Views: 121

Answers (2)

Alexander
Alexander

Reputation: 109626

A few comments:

  1. As noted above, you should simplify your lambda expressions for readability, possibly using def.

Example:

def tobacco(codes):
    return any( 'C30' <= x < 'C40' or
                'F17' <= x < 'F18'  for x in codes)

You can also vectorize these functions as follows:

def tobacco(codes_column):
    return [any('C30' <= code < 'C40' or
                'F17' <= code < 'F18'
                for code in codes) if codes else False
            for codes in codes_column]

diagnoses = all_treatments['DIAGNOS'].str.split(' ').tolist()
all_treatments['tobacco'] = tobacco(diagnoses)
  1. You initialize all_treatments to a DataFrame, and then append to it. This is very inefficient. Try all_treatments = list(), and add all_treatments = pd.concat(all_treatments, ignore_index=True) outside the loop just before your groupby. In addition, it should be all_treatments.append(treatments) (vs. all_treatments = all_treatments.append(treatments))

  2. To calculate the month for the purpose of grouping, you can use:

    all_treatments['month'] = all_treatments.INDATUMA % 10000 // 100

  3. Lastly, instead of applying your lambda functions to each file once its read, try applying them to the all_treatments DataFrame instead.

p.s. You may also want to try .sum() on your groupby statement instead of .aggregate(np.count_nonzero)

Upvotes: 1

JoeCondron
JoeCondron

Reputation: 8906

I think you need to find a way to vectorize your solution. Using map and lambda functions is not efficient and doesn't make use of the speed-ups that make pandas so useful. It's hard to say for sure because you haven't posted sample data, but I think a good starting point would be to do

diagnoses = treatments['DIAGNOS'].str.split(expand=True)

The result will be a data frame with a column for each word (or element in the result of split). You can then do vectorized comparisons against the whole DataFrame. It might looks something like this:

between_c_vals = (diagnoses >= 'C30') & (diagnoses <= 'C40')
between_f_vals = (diagnoses >= 'F17') & (diagnoses <= 'F18')
treatment['tobacco'] = (between_c_vals | between_f_vals).any(axis=1)

This should be hundreds of times faster than using .map which uses loops in Python. Note that the bit operators & and | can be used to perform set logic on Boolean vectors and matrices (or DataFrames). If you showed an example of treatment['DIAGNOS'] I could maybe help more. One thing to be careful of is NaN values when doing comparisons as comparing NaN to anything always return False but I think it should be fine here is it won't return any unwanted True values.

Upvotes: 1

Related Questions