Reputation: 528
I have an issue which I think I somewhat solved but I would like to learn more about it or learn about better solutions.
The problem: I have tab separated files with ~600k lines (and one comment line), of which one field (out of 8 fields) contains a string of variable length, anything between 1 and ~2000 characters.
Reading that file with the following function is terribly slow:
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
names=list_of_names)
However, perhaps I don't care so much about most of the string (field name of this string is 'motif') and I'm okay with truncating it if it's too long using:
def truncate_motif(motif):
if len(motif) > 8:
return motif[:8] + '~'
else:
return motif
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
converters={'motif': truncate_motif},
names=list_of_names)
This suddenly is lots faster.
So my questions are:
Upvotes: 4
Views: 1275
Reputation: 599
then this file's size is: 600 000 * 1000B ~ 570 MB. It's a lot, especially if you don't have much RAM memory.
It helps, because suddenly the average size of one line is not 1000B, but ca. 6-7B (considering new max = 8B). The system does not read and keep whole strings, but only checks their length and cut if it's needed. Sounds logic to me!
In such cases, when you have to load lots of data, it's good to use chunks.
for chunk in pd.read_csv(tgfile, chunksize=10000):
process(chunk)
The chunksize
parameter says how many rows contains one chunk. It's good to check if it improves performance in your case!
Upvotes: 2