Reputation: 2814
I want to read and handle a large CSV file (data_file
) having the following 2-columns structure:
id params
1 '14':'blah blah','25':'more cool stuff'
2 '157':'yes, more stuff','15':'and even more'
3 '14':'blah blah','25':'more cool stuff'
4 '15':'different here'
5 '157':'yes, more stuff','15':'and even more'
6 '100':'exhausted'
This file contains 30.000.000 lines (5 Gb on disk). (The actual strings are encoded in UTF-8; for simplicity, I gave them in ascii here). Note that some of the values in the 2nd column are repeated.
I read this using pandas.read_csv()
:
df = pandas.read_csv(open(data_file, 'rb'), delimiter='\t',
usecols=['id', 'params'],dtype={'id':'u4', 'params':'str'})
Once the file is read, the data frame df
uses 1.2 Gb of RAM.
So far so good.
Now comes the processing part. I want to have the params
string column with this format:
blah blah||more cool stuff
yes, more stuff||and even more
blah blah||more cool stuff
different here
yes, more stuff||and even more
exhausted
I wrote:
def clean_keywords(x):
return "||".join(x.split("'")[1:][::2])
df['params'] = df['params'].map(clean_keywords)
This code works in the sense it gives the correct result. But:
map
operation.df
(after gc.collect()
), although the string computed in params
column is shorter than the one that was read.Can someone explain this and propose an alternative way of performing the above operation using pandas (I use python 3.4, pandas 0.16.2, win64) ?
Upvotes: 2
Views: 707
Reputation: 2814
Answering my own question.
It turns out that pandas.read_csv()
is clever. When the file is read, strings are made unique. But when these string are processed and stored in the column, they are no longer unique. Hence the RAM usage increases. In order to avoid this, one has to maintain the uniqueness manually. I did it this way:
unique_strings = {}
def clean_keywords(x):
s = "||".join(x.split("'")[1:][::2])
return unique_strings.setdefault(s, s)
df['params'] = df['params'].map(clean_keywords)
With this solution, RAM max. usage was only 2.8 Gb and went down slightly under the initial RAM usage after reading data (1.2 Gb), as expected.
Upvotes: 2