khemedi
khemedi

Reputation: 806

How to shuffle very large .csv files with headers in python?

Based on this post, using shuf is the fastest way:

import sh
sh.shuf("words.txt", out="shuffled_words.txt")

However, this code shuffle the header as well. My file has a header and I don't want the header to shuffle in the data.

Upvotes: 2

Views: 490

Answers (1)

DYZ
DYZ

Reputation: 57085

Copy the content of the file into another file without the header:

with open("words.txt") as infile, open("words-nohead.txt", "w") as outfile:
    for i,line in enumerate(infile):
        if i: outfile.write(line)

Then shuffle the headerless file. Then copy the first line of the first file and the headerless file into shuffled_words.txt (I think you can use sh.cat() for this) and remove the interim files.

Actually, you do not need Python for this. Bash alone suffices:

head -n 1 words.txt > shuffled_words.txt    
tail -n+2 words.txt | shuf >> shuffled_words.txt

Bear in mind that shuf reads the whole file in memory, anyway. You must have enough memory for the file.

Upvotes: 2

Related Questions