pandas.read_csv slow when reading file with variable length string

Question

I have an issue which I think I somewhat solved but I would like to learn more about it or learn about better solutions.

The problem: I have tab separated files with ~600k lines (and one comment line), of which one field (out of 8 fields) contains a string of variable length, anything between 1 and ~2000 characters.

Reading that file with the following function is terribly slow:

df = pd.read_csv(tgfile,
                 sep="	",
                 comment='#',
                 header=None,
                 names=list_of_names)

However, perhaps I don't care so much about most of the string (field name of this string is 'motif') and I'm okay with truncating it if it's too long using:

def truncate_motif(motif):
    if len(motif) > 8:
        return motif[:8] + '~'
    else:
        return motif

df = pd.read_csv(tgfile,
                 sep="	",
                 comment='#',
                 header=None,
                 converters={'motif': truncate_motif},
                 names=list_of_names)

This suddenly is lots faster.

So my questions are:

Why is reading this file so slow? Does it have to do with allocating memory?
Why does the converter function help here? It has to execute an additional function for every row, but is still lots faster...
What else can be done?

pandas.read_csv slow when reading file with variable length string

Answers (1)

Related Questions