Raymond
Raymond

Reputation: 103

Read only one random row from CSV file and move to another CSV

I'm facing a problem in reading random rows from a large csv file and moving it to another CSV file using 0.18.1 pandas and 2.7.10 Python on Windows.

I want to load only the randomly selected rows into the memory and move them to another CSV. I don't want to load the entire content of first CSV into memory.

This is the code I used:

import random

file_size = 100
f = open("customers.csv",'r')
o = open("train_select.csv", 'w')
for i in range(0, 50):
    offset = random.randrange(file_size)
    f.seek(offset)
    f.readline()
    random_line = f.readline()
    o.write(random_line)

The current output looks something like this:

2;flhxu-name;tum-firstname; 17520;buo-city;1966/04/24;wfyz-street;   96;GA;GEORGIA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA

My problems are 2 fold:

  1. I want to see the header also in the second csv and not just the rows.

  2. A row should be selected by random function only once.

The output should be something like this:

id;name;firstname;zip;city;birthdate;street;housenr;stateCode;state
2;flhxu-name;tum-firstname; 17520;buo-city;1966/04/24;wfyz-street;   96;GA;GEORGIA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA

Upvotes: 2

Views: 779

Answers (2)

Jean-François Fabre
Jean-François Fabre

Reputation: 140256

At OP request, and since my 2 previous implementations had to read the input file, here's a more complex implementation where the file is not read in advance.

It uses bisect to store the couples of offsets of the lines, and a minimum line len (to be configured) in order to avoid that the random list is too long for nothing.

Basically, the program generates randomly ordered offsets ranging from offset of the second line (title line skipped) to the end of file, by step of minimum_line_len.

For each offset, it checks if line has not already been read (using bisect, which is fast but further testing is complex because of the corner cases). - if not read, skip back to find previous linefeed (that is reading the file, can't do otherwise) write it in the output file, store the start/end offsets in the couple list - if already read, skip

the code:

import random,os,bisect

input_file = "csv2.csv"

input_size = os.path.getsize(input_file)

smallest_line_len = 4

line_offsets = []

# just read the title
with open(input_file,'r') as f, open("train_select.csv", 'w') as fw:
    # read title and write it back
    title = f.readline()
    fw.write(title)

    # generate offset list, starting from current pos to the end of file
    # with a step of min line len to avoid generating too many numbers
    # (this can be 1 but that will take a while)
    offset_list = list(range(f.tell(),input_size,smallest_line_len))
    # shuffle the list at random
    random.shuffle(offset_list)

    # now loop through the offsets
    for offset in offset_list:
        # look if the offset is already contained in the list of sorted tuples
        insertion_point = bisect.bisect(line_offsets,(offset,0))

        if len(line_offsets)>0 and insertion_point == len(line_offsets) and line_offsets[-1][1]>offset:
            # bisect tells to insert at the end: check if within last couple boundary: if so, already processed
            continue
        elif insertion_point < len(line_offsets) and (offset==line_offsets[insertion_point][0] or
               (0 < insertion_point and line_offsets[insertion_point-1][0]<=offset<=line_offsets[insertion_point-1][1])):
            # offset is already known, line has already been processed: skip
            continue
        else:
            # offset is not known: rewind until we meet an end of line
            f.seek(offset)

            while True:
                c=f.read(1)
                if c=="\n":
                    # we found the line terminator of the previous line: OK
                    break
                offset -= 1
                f.seek(offset)
            # now store the current position: start of the current line
            line_start = offset+1
            # now read the line fully
            line = f.readline()
            # now compute line end (approx..)
            line_end = f.tell() - 1
            # and insert the "line" in the sorted list
            line_offsets.insert(insertion_point,(line_start,line_end))
            fw.write(line)
  • if

Upvotes: 1

Jean-Fran&#231;ois Fabre
Jean-Fran&#231;ois Fabre

Reputation: 140256

You have do simpler than that:

  1. first, read the customers file fully, title is a special case, keep it out.
  2. shuffle the list of lines (that's what you were looking for)
  3. write back title + shuffled lines

code:

import random

with open("customers.csv",'r') as f:
    title = f.readline()
    lines = f.readlines()

random.shuffle(lines)

with open("train_select.csv", 'w') as f:
    f.write(title)
    f.writelines(lines)

EDIT: if you don't want to hold the whole file in memory, here's an alternative. The only drawback is that you have to read the file once (but not store in memory) to compute line offsets:

import random

input_file = "customers.csv"

line_offsets = list()

# just read the title
with open(input_file,'r') as f:
    title = f.readline()
    # store offset of the first
    while True:
        # store offset of the next line start
        line_offsets.append(f.tell())
        line = f.readline()
        if line=="":
            break

    # now shuffle the offsets
    random.shuffle(line_offsets)

    # and write the output file
    with open("train_select.csv", 'w') as fw:
        fw.write(title)
        for offset in line_offsets:
            # seek to a line start
            f.seek(offset)
            fw.write(f.readline())

Upvotes: 1

Related Questions