pandas: handling a DataFrame with a large number of strings

Question

I want to read and handle a large CSV file (data_file) having the following 2-columns structure:

id params
1  '14':'blah blah','25':'more cool stuff'
2  '157':'yes, more stuff','15':'and even more'
3  '14':'blah blah','25':'more cool stuff'
4  '15':'different here'
5  '157':'yes, more stuff','15':'and even more'
6  '100':'exhausted'

This file contains 30.000.000 lines (5 Gb on disk). (The actual strings are encoded in UTF-8; for simplicity, I gave them in ascii here). Note that some of the values in the 2nd column are repeated.

I read this using pandas.read_csv():

df =  pandas.read_csv(open(data_file, 'rb'), delimiter='	', 
         usecols=['id', 'params'],dtype={'id':'u4', 'params':'str'})

Once the file is read, the data frame df uses 1.2 Gb of RAM.

So far so good.

Now comes the processing part. I want to have the params string column with this format:

blah blah||more cool stuff
yes, more stuff||and even more
blah blah||more cool stuff
different here
yes, more stuff||and even more
exhausted

I wrote:

def clean_keywords(x): 
    return "||".join(x.split("'")[1:][::2])

df['params'] = df['params'].map(clean_keywords)

This code works in the sense it gives the correct result. But:

More than 6.8 Gb of RAM are used while performing the map operation.
After computation has been finished, 5.5 Gb of RAM are used by df (after gc.collect()), although the string computed in params column is shorter than the one that was read.

Can someone explain this and propose an alternative way of performing the above operation using pandas (I use python 3.4, pandas 0.16.2, win64) ?

pandas: handling a DataFrame with a large number of strings

Answers (1)

Related Questions