Reputation: 5086

Randomly sampling lines from a file

I have a csv file which is ~40gb and 1800000 lines.

I want to randomly sample 10,000 lines and print them to a new file.

Right now, my approach is to use sed as:

(sed -n '$vars' < input.txt) > output.txt

Where $vars is a randomly generated list of lines. (Eg: 1p;14p;1700p;...;10203p)

While this works, it takes about 5 minutes per execution. It's not a huge time, but I was wondering if anybody had ideas on how to make it quicker?

Upvotes: 7

Answers (6)

Mad Physicist

Reputation: 114320

The biggest advantage to having lines of the same length is that you don't need to find newlines to know where each line starts. With a file size of ~40GB containing ~1.8M lines, you have a line length of ~20KB/line. If you want to sample 10K lines, you have ~40MB between lines. This is almost certainly around three orders of magnitude larger than the size of a block on your disk. Therefore, seeking to the next read location is much much more efficient than reading every byte in the file.

Seeking will work with files that have unequal line lenghs (e.g., non-ascii characters in UTF-8 encoding), but will require minor modifications to the method. If you have unequal lines, you can seek to an estimated location, then scan to the start of the next line. This is still quite efficient because you will be skipping ~40MB for every ~20KB you need to read. Your sampling uniformity will be compromised slightly since you will select byte locations instead of line locations, and you won't know which line number you are reading for sure.

You can implement your solution directly with the Python code that generates your line numbers. Here is a sample of how to deal with lines that all have the same number of bytes (usually ascii encoding):

import random
from os.path import getsize

# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000

file_size = getsize(file_name)
with open(file_name) as file:
    # Read the first line to get the length
    file.readline()
    line_size = file.tell()
    # You don't have to seek(0) here: if line #0 is selected,
    # the seek will happen regardless later.

    # Assuming you are 100% sure all lines are equal, this might
    # discard the last line if it doesn't have a trailing newline.
    # If that bothers you, use `math.round(file_size / line_size)`
    line_count = file_size // line_size
    # This is just a trivial example of how to generate the line numbers.
    # If it doesn't work for you, just use the method you already have.
    # By the way, this will just error out (ValueError) if you try to
    # select more lines than there are in the file, which is ideal
    selection_indices = random.sample(range(line_count), selection_count)
    selection_indices.sort()

    # Now skip to each line before reading it:
    prev_index = 0
    for line_index in selection_indices:
        # Conveniently, the default seek offset is the start of the file,
        # not from current position
        if line_index != prev_index + 1:
            file.seek(line_index * line_size)
        print('Line #{}: {}'.format(line_index, file.readline()), end='')
        # Small optimization to avoid seeking consecutive lines.
        # Might be unnecessary since seek probably already does
        # something like that for you
        prev_index = line_index

If you are willing to sacrifice a (very) small amount of uniformity in the distribution of line numbers, you can easily apply a similar technique to files with unequal line lengths. You just generate random byte offsets, and skip to the next full line after the offset. In the following implementation, it is assumed that you know for a fact that no line is longer than 40KB in length. You would have to do something like this if your CSV had non-ascii unicode characters encoded in UTF-8, because even if the lines all contained the same number of characters, they would contain different numbers of bytes. In this case, you would have to open the file in binary mode, since otherwise you might run into decoding errors when you skip to a random byte, if that byte happens to be mid-character:

import random
from os.path import getsize

# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
# An upper bound on the line size in bytes, not chars
# This serves two purposes:
#   1. It determines the margin to use from the end of the file
#   2. It determines the closest two offsets are allowed to be and
#      still be 100% guaranteed to be in different lines
max_line_bytes = 40000

file_size = getsize(file_name)
# make_offset is a function that returns `selection_count` monotonically
# increasing unique samples, at least `max_line_bytes` apart from each
# other, in the range [0, file_size - margin). Implementation not provided.
selection_offsets = make_offsets(selection_count, file_size, max_line_bytes)
with open(file_name, 'rb') as file:
    for offset in selection_offsets:
        # Skip to each offset
        file.seek(offset)
        # Readout to the next full line
        file.readline()
        # Print the next line. You don't know the number.
        # You also have to decode it yourself.
        print(file.readline().decode('utf-8'), end='')

All code here is Python 3.

Upvotes: 6

thanasisp

Reputation: 5975

In case all lines have the same length, you could do it without the need to parse the whole file or load it to memory, using dd.

You have to know the lines number, having already executed wc -l, and the precise byte length of each line, and of course to have test and ensure all lines really have the same length. Even wc will be slow as it will read the whole file.

For example, if every line is 20000 bytes

#!/bin/bash

for i in `shuf -n 10000 -i 0-1799999 | sort -n`
do
    dd if=file bs=20000 skip="$i" count=1 of=output status=none \
        oflag=append conv=notrunc
done

This way we loop and run 10K processes, I am not sure if it could be done at once, so although dd is faster, using a language, like Python and seek() method, (as @tripleee says and @Mad Physicist hinted at with comments) will have the advantage of one process.

#!/usr/bin/python3
import random

randoms = random.sample(range(0, 1800000), 10000)
randoms.sort()

lsize = 20000

with open("file", "rb") as infile, open('output', 'wb') as outfile:
    for n in randoms:
        infile.seek(lsize * n)
        outfile.write(infile.read(lsize))

save some more seconds, if output is small enough, you can keep it in a bytearray and write it at once at the end.

with open("file", "rb") as infile, open('output', 'wb') as outfile:
    buf = bytearray()
    for n in randoms:
        infile.seek(lsize * n)
        buf.extend(infile.read(lsize))
    outfile.write(buf)

Upvotes: 2

dawg

Reputation: 103844

For testing purposes, let's create a file of 1,800,000 lines:

$ awk 'BEGIN {for (i=1; i<=1800000; i++) print "line " i}' >file
$ ls -l file
-rw-r--r--  1 dawg  wheel  22288896 Jan  1 09:41 file

Assuming you don't know the number of lines in that file, the fastest way to get the total number of lines is with the POSIX utility wc:

$ time wc -l file
 1800000 file

real    0m0.018s
user    0m0.012s
sys 0m0.004s

So to get the total line count of a text file with 1,800,000 lines is pretty fast.

Now that you know the total number of lines, you can use awk to print a random sample of those lines:

#!/bin/bash

lc=($(wc -l file))
awk -v lc="$lc" -v c=10000 '
BEGIN{srand()}
int(lc*rand())<=c{print; i++}
i>=c{exit}
' file >rand_lines

That runs in about 200 ms on my older iMac. Note that the total is close to 10,000 but likely less since you will hit the end of the file often before you hit 10,000 lines.

If you want exactly 10,000 at a penalty of true randomness, you can do:

awk -v lc="$lc" -v c=10000 '
BEGIN{srand()}
int(lc*rand())<c * (1.01 or a factor to make sure that 10,000 is hit before EOF) {print; i++}
i>=c{exit}
' file >rand_lines

Or, alternatively, generate 10,000 unique numbers between 1 and the number of lines:

awk -v lc="$lc" -v c=10000 '
BEGIN{srand()
      while (i<c) {
        x=int(lc*rand())
        if (x in rl) continue  # careful if c is larger than or close to lc
        else {
        rl[x]
        i++} 
        }
     }
NR in rl' file >rand_lines

Upvotes: 1

tripleee

Reputation: 189397

If indeed your lines are all the same length, your Python script can randomly seek() ahead in the file, and you know which index exactly to seek to in order to land exactly on the character after a newline.

The Python script which generates the random indices for your sed script should be easy to adapt to this approach. Basically, when you generate 123p to feed into sed, instead seek to 122*line length and read the line you land at.

A complication is that Python 3 prohibits random seeks in files which are opened in text mode (because it needs to know where enooded characters start and end). For a quick and dirty script, simply reading and writing bytes should be fine (in general the recommendation is to decode bytes into Unicode, then encode again before writing; but since you aren't processing the lines in Python at all, this is unnecessary).

Upvotes: 1

Elias Toivanen

Reputation: 848

Yet another idea, borrowed from the world of Monte Carlo simulations, would be to loop over the lines and generate a random number in each iteration. Now, if you want 10k lines from a set of 180k lines, you'd reason as follows. There is a 10/180 change that you want to include the row in question. If the random number is less than or equal to 10/180, you accept the row. Otherwise you reject it or break the loop if the desired amount of rows has been collected.

The downside of this approach that there is no guarantee that exactly 10k lines are sampled. I am also suspicious that there are biases in this approach and that it will not be random enough.

Upvotes: 0

Elias Toivanen

Reputation: 848

You will want to insert the data into a database (such as sqlite or mysql), and then repeat your idea in SQL

select * from your_table where id in (1, 14, 1700, ...)

You can also read up how to select a random sample from this excellent tutorial http://jan.kneschke.de/projects/mysql/order-by-rand/ and

There is no way devise a shell script that would run significantly faster, as your code ultimately relies on the way filesystems fundamentally work. That is, for good performance you want to access the disk sequentially and in chunks. Databases are designed to solve this issue by storing how the data is laid out in the harddrive in a separate file called the index. It works much in the same way as an index of a book.

This is a rich topic and requires some learning. If you're new to database programming, 40 gb dataset is a good starting point, though.

Upvotes: 0

Randomly sampling lines from a file

Answers (6)

Related Questions