mrnovice
mrnovice

Reputation: 253

How can I shuffle the rows of a large csv file and write the result to a new csv file without using too much memory?

So if I have a csv file as follows:

User  Gender
 A    M
 B    F
 C    F

Then I want to write another csv file with rows shuffled like so (as an example):

User  Gender
 C    F
 A    M
 B    F

My problem is that I don't know how to randomly select rows and ensure that I get every row from the original csv file. For reference my csv file is around 3gb. If I load my entire dataset into a dataframe and use the random package to shuffle it, my PC crashes due to RAM use.

Upvotes: 2

Views: 2970

Answers (3)

PascalVKooten
PascalVKooten

Reputation: 21451

Probably the easiest (and fastest) is to use shuf in bash!

shuf words.txt > shuffled_words.txt

(I know you asked for a Python solution, but I am going to assume this is still a better answer)

To programmatically do it from Python:

import sh
sh.shuf("words.txt", out="shuffled_words.txt")

Upvotes: 2

Cyriaque Pin
Cyriaque Pin

Reputation: 108

You can use the chunk_size argument to csv in chunks

df_chunks = pandas.read_csv("your_csv_name.csv", chunk_size=10)

Then you can shuffle only the chunks so it takes less memory

for chunk in df_chunks:
    do stuff

Then you can concat them and save it into another csv :

new_df = pandas.concat(new_chunks)
new_df.to_csv("your_new_csv_name.csv")

If you have memory issue, while you create new_chunks don't forget to erase old one as you don't want them to be left in RAM for no reason, you can do it with chunk=None

Upvotes: 0

Joop Eggen
Joop Eggen

Reputation: 109593

  • Create an array of lines as file positions of line starts, by reading the file once as random access or memory mapped file. The array has one extra entry with the file length, so the line i holds the bytes [array[i], array[i+1]>.
  • Shuffle the indices 0 .. number of lines - 1.
  • Now you can use random access positioning (seek) to read a line buffer.

Upvotes: 0

Related Questions