Reputation: 103
I'm facing a problem in reading random rows from a large csv file and moving it to another CSV file using 0.18.1 pandas and 2.7.10 Python on Windows.
I want to load only the randomly selected rows into the memory and move them to another CSV. I don't want to load the entire content of first CSV into memory.
This is the code I used:
import random
file_size = 100
f = open("customers.csv",'r')
o = open("train_select.csv", 'w')
for i in range(0, 50):
offset = random.randrange(file_size)
f.seek(offset)
f.readline()
random_line = f.readline()
o.write(random_line)
The current output looks something like this:
2;flhxu-name;tum-firstname; 17520;buo-city;1966/04/24;wfyz-street; 96;GA;GEORGIA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
My problems are 2 fold:
I want to see the header also in the second csv and not just the rows.
A row should be selected by random function only once.
The output should be something like this:
id;name;firstname;zip;city;birthdate;street;housenr;stateCode;state
2;flhxu-name;tum-firstname; 17520;buo-city;1966/04/24;wfyz-street; 96;GA;GEORGIA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
Upvotes: 2
Views: 779
Reputation: 140256
At OP request, and since my 2 previous implementations had to read the input file, here's a more complex implementation where the file is not read in advance.
It uses bisect
to store the couples of offsets of the lines, and a minimum line len (to be configured) in order to avoid that the random list is too long for nothing.
Basically, the program generates randomly ordered offsets ranging from offset of the second line (title line skipped) to the end of file, by step of minimum_line_len.
For each offset, it checks if line has not already been read (using bisect
, which is fast but further testing is complex because of the corner cases).
- if not read, skip back to find previous linefeed (that is reading the file, can't do otherwise) write it in the output file, store the start/end offsets in the couple list
- if already read, skip
the code:
import random,os,bisect
input_file = "csv2.csv"
input_size = os.path.getsize(input_file)
smallest_line_len = 4
line_offsets = []
# just read the title
with open(input_file,'r') as f, open("train_select.csv", 'w') as fw:
# read title and write it back
title = f.readline()
fw.write(title)
# generate offset list, starting from current pos to the end of file
# with a step of min line len to avoid generating too many numbers
# (this can be 1 but that will take a while)
offset_list = list(range(f.tell(),input_size,smallest_line_len))
# shuffle the list at random
random.shuffle(offset_list)
# now loop through the offsets
for offset in offset_list:
# look if the offset is already contained in the list of sorted tuples
insertion_point = bisect.bisect(line_offsets,(offset,0))
if len(line_offsets)>0 and insertion_point == len(line_offsets) and line_offsets[-1][1]>offset:
# bisect tells to insert at the end: check if within last couple boundary: if so, already processed
continue
elif insertion_point < len(line_offsets) and (offset==line_offsets[insertion_point][0] or
(0 < insertion_point and line_offsets[insertion_point-1][0]<=offset<=line_offsets[insertion_point-1][1])):
# offset is already known, line has already been processed: skip
continue
else:
# offset is not known: rewind until we meet an end of line
f.seek(offset)
while True:
c=f.read(1)
if c=="\n":
# we found the line terminator of the previous line: OK
break
offset -= 1
f.seek(offset)
# now store the current position: start of the current line
line_start = offset+1
# now read the line fully
line = f.readline()
# now compute line end (approx..)
line_end = f.tell() - 1
# and insert the "line" in the sorted list
line_offsets.insert(insertion_point,(line_start,line_end))
fw.write(line)
Upvotes: 1
Reputation: 140256
You have do simpler than that:
code:
import random
with open("customers.csv",'r') as f:
title = f.readline()
lines = f.readlines()
random.shuffle(lines)
with open("train_select.csv", 'w') as f:
f.write(title)
f.writelines(lines)
EDIT: if you don't want to hold the whole file in memory, here's an alternative. The only drawback is that you have to read the file once (but not store in memory) to compute line offsets:
import random
input_file = "customers.csv"
line_offsets = list()
# just read the title
with open(input_file,'r') as f:
title = f.readline()
# store offset of the first
while True:
# store offset of the next line start
line_offsets.append(f.tell())
line = f.readline()
if line=="":
break
# now shuffle the offsets
random.shuffle(line_offsets)
# and write the output file
with open("train_select.csv", 'w') as fw:
fw.write(title)
for offset in line_offsets:
# seek to a line start
f.seek(offset)
fw.write(f.readline())
Upvotes: 1