Nigu
Nigu

Reputation: 2135

Randomly mix lines of 3 million-line file

Everything is in the title. I'm wondering if any one knows a quick and with reasonable memory demands way of randomly mixing all the lines of a 3 million lines file. I guess it is not possible with a simple vim command, so any simple script using Python. I tried with python by using a random number generator, but did not manage to find a simple way out.

Upvotes: 42

Views: 33169

Answers (11)

John Kugelman
John Kugelman

Reputation: 361605

Takes only a few seconds in Python:

import random
lines = open('3mil.txt').readlines()
random.shuffle(lines)
open('3mil.txt', 'w').writelines(lines)

Upvotes: 63

Akib Sadmanee
Akib Sadmanee

Reputation: 159

It is not a necessary solution to your problem. Just keeping it here for the people who come here seeking solution for shuffling a file of bigger size. But it will work for smaller files as well. Change split -b 1GB to a smaller file size i.e. split -b 100MB to make a lot of text files each sizing 100MB.

I had a 20GB file containing more than 1.5 billion sentences in it. Calling shuf command in the linux terminal simply overwhelmed both my 16GB RAM and a same swap area. This is a bash script I wrote to get the job done. It assumes that you keep the bash script in the same folder as your big text file.

#!/bin

#Create a temporary folder named "splitted" 
mkdir ./splitted


#Split input file into multiple small(1GB each) files
#This is will help us shuffle the data
echo "Splitting big txt file..."
split -b 1GB ./your_big_file.txt ./splitted/file --additional-suffix=.txt
echo "Done."

#Shuffle the small files
echo "Shuffling splitted txt files..."
for entry in "./splitted"/*.txt
do
  shuf $entry -o $entry
done
echo "Done."

#Concatinate the splitted shuffled files into one big text file
echo "Concatinating shuffled txt files into 1 file..."
cat ./splitted/* > ./your_big_file_shuffled.txt
echo "Done"

#Delete the temporary "splitted" folder
rm -rf ./splitted
echo "Complete."

Upvotes: 1

Kumaresp
Kumaresp

Reputation: 45

This will do the trick: My solution even don't use random and it will also remove duplicates.

import sys
lines= list(set(open(sys.argv[1]).readlines()))
print(' '.join(lines))

in the shell

python shuffler.py nameoffilestobeshuffled.txt > shuffled.txt

Upvotes: -3

builder-7000
builder-7000

Reputation: 7627

The following Vimscript can be used to swap lines:

function! Random()                                                       
  let nswaps = 100                                                       
  let firstline = 1                                                     
  let lastline = 10                                                      
  let i = 0                                                              
  while i <= nswaps                                                      
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe line.'d'                                                         
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe "normal! " . line . 'Gp'                                         
    let i += 1                                                           
  endwhile                                                               
endfunction

Select the function in visual mode and type :@" then execute it with :call Random()

Upvotes: 0

Aziz Alto
Aziz Alto

Reputation: 20311

Here is another way using random.choice, this may provide some gradual memory relieve as well, but with a worse Big-O :)

from random import choice

with open('data.txt', 'r') as r:
    lines = r.readlines()

with open('shuffled_data.txt', 'w') as w:
    while lines:
        l = choice(lines)
        lines.remove(l)
        w.write(l)

Upvotes: 1

Drag0
Drag0

Reputation: 8918

I just tried this on a file with 4.3M of lines and fastest thing was 'shuf' command on Linux. Use it like this:

shuf huge_file.txt -o shuffled_lines_huge_file.txt

It took 2-3 seconds to finish.

Upvotes: 31

S.Lott
S.Lott

Reputation: 391846

Here's another version

At the shell, use this.

python decorate.py | sort | python undecorate.py

decorate.py

import sys
import random
for line in sys.stdin:
    sys.stdout.write( "{0}|{1}".format( random.random(), line ) )

undecorate.py

import sys
for line in sys.stdin:
    _, _, data= line.partition("|")
    sys.stdout.write( line )

Uses almost no memory.

Upvotes: 2

S.Lott
S.Lott

Reputation: 391846

import random
with open('the_file','r') as source:
    data = [ (random.random(), line) for line in source ]
data.sort()
with open('another_file','w') as target:
    for _, line in data:
        target.write( line )

That should do it. 3 million lines will fit into most machine's memory unless the lines are HUGE (over 512 characters).

Upvotes: 38

Lennart Regebro
Lennart Regebro

Reputation: 172239

If you do not want to load everything into memory and sort it there, you have to store the lines on disk while doing random sorting. That will be very slow.

Here is a very simple, stupid and slow version. Note that this may take a surprising amount of diskspace, and it will be very slow. I ran it with 300.000 lines, and it takes several minutes. 3 million lines could very well take an hour. So: Do it in memory. Really. It's not that big.

import os
import tempfile
import shutil
import random
tempdir = tempfile.mkdtemp()
print tempdir

files = []
# Split the lines:
with open('/tmp/sorted.txt', 'rt') as infile:
    counter = 0    
    for line in infile:
        outfilename = os.path.join(tempdir, '%09i.txt' % counter)
        with open(outfilename, 'wt') as outfile:
            outfile.write(line)
        counter += 1
        files.append(outfilename)

with open('/tmp/random.txt', 'wt') as outfile:
    while files:
        index = random.randint(0, len(files) - 1)
        filename = files.pop(index)
        outfile.write(open(filename, 'rt').read())

shutil.rmtree(tempdir)

Another version would be to store the files in an SQLite database and pull the lines randomly from that database. That is probably going to be faster than this.

Upvotes: 2

sleepynate
sleepynate

Reputation: 8036

This is the same as Mr. Kugelman's, but using vim's built-in python interface:

:py import vim, random as r; cb = vim.current.buffer ; l = cb[:] ; r.shuffle(l) ; cb[:] = l

Upvotes: 2

fuzzyTew
fuzzyTew

Reputation: 3768

On many systems the sort shell command takes -R to randomize its input.

Upvotes: 3

Related Questions