Wouter De Coster
Wouter De Coster

Reputation: 528

pandas.read_csv slow when reading file with variable length string

I have an issue which I think I somewhat solved but I would like to learn more about it or learn about better solutions.

The problem: I have tab separated files with ~600k lines (and one comment line), of which one field (out of 8 fields) contains a string of variable length, anything between 1 and ~2000 characters.

Reading that file with the following function is terribly slow:

df = pd.read_csv(tgfile,
                 sep="\t",
                 comment='#',
                 header=None,
                 names=list_of_names)

However, perhaps I don't care so much about most of the string (field name of this string is 'motif') and I'm okay with truncating it if it's too long using:

def truncate_motif(motif):
    if len(motif) > 8:
        return motif[:8] + '~'
    else:
        return motif

df = pd.read_csv(tgfile,
                 sep="\t",
                 comment='#',
                 header=None,
                 converters={'motif': truncate_motif},
                 names=list_of_names)

This suddenly is lots faster.

So my questions are:

  1. Why is reading this file so slow? Does it have to do with allocating memory?
  2. Why does the converter function help here? It has to execute an additional function for every row, but is still lots faster...
  3. What else can be done?

Upvotes: 4

Views: 1275

Answers (1)

Rafaó
Rafaó

Reputation: 599

  1. You didn't mention what means slow to you, but if:
    • your file contains ca. 600k rows,
    • each row contains 1-2000 characters (let's say 1000 in average, so each line has ca. 1000B),

then this file's size is: 600 000 * 1000B ~ 570 MB. It's a lot, especially if you don't have much RAM memory.

  1. It helps, because suddenly the average size of one line is not 1000B, but ca. 6-7B (considering new max = 8B). The system does not read and keep whole strings, but only checks their length and cut if it's needed. Sounds logic to me!

  2. In such cases, when you have to load lots of data, it's good to use chunks.

for chunk in pd.read_csv(tgfile, chunksize=10000): process(chunk)

The chunksize parameter says how many rows contains one chunk. It's good to check if it improves performance in your case!

Upvotes: 2

Related Questions