Reputation: 253
So if I have a csv file as follows:
User Gender
A M
B F
C F
Then I want to write another csv file with rows shuffled like so (as an example):
User Gender
C F
A M
B F
My problem is that I don't know how to randomly select rows and ensure that I get every row from the original csv file. For reference my csv file is around 3gb. If I load my entire dataset into a dataframe and use the random package to shuffle it, my PC crashes due to RAM use.
Upvotes: 2
Views: 2970
Reputation: 21451
Probably the easiest (and fastest) is to use shuf
in bash!
shuf words.txt > shuffled_words.txt
(I know you asked for a Python solution, but I am going to assume this is still a better answer)
To programmatically do it from Python:
import sh
sh.shuf("words.txt", out="shuffled_words.txt")
Upvotes: 2
Reputation: 108
You can use the chunk_size argument to csv in chunks
df_chunks = pandas.read_csv("your_csv_name.csv", chunk_size=10)
Then you can shuffle only the chunks so it takes less memory
for chunk in df_chunks:
do stuff
Then you can concat them and save it into another csv :
new_df = pandas.concat(new_chunks)
new_df.to_csv("your_new_csv_name.csv")
If you have memory issue, while you create new_chunks don't forget to erase old one as you don't want them to be left in RAM for no reason, you can do it with
chunk=None
Upvotes: 0
Reputation: 109593
[array[i], array[i+1]>
. Upvotes: 0