Reputation: 4062
I have a huge file (millions of lines). I want to get a random sample from it, I've generated a list of unique random numbers and now I want to get all the lines whose line number would match my random numbers generated.
Sorting the random numbers is not a problem, so I was thinking I can take the difference between consecutive numbers and just jump the difference with the cursor in the file.
I think I should use sed
or awk
.
Upvotes: 0
Views: 108
Reputation: 5972
You can use awk
and shuf
:
shuf file.txt > shuf.txt
awk '!a[$0]++' shuf.txt > uniqed.txt
This awk
is best tool for removing duplicates.
Upvotes: 0
Reputation: 289745
Why don't you directly use shuf
to get random lines:
shuf -n NUMBER_OF_LINES file
$ seq 100 >a # the file "a" contains number 1 to 100, each one in a line
$ shuf -n 4 a
54
46
30
53
$ shuf -n 4 a
50
37
63
21
Can I somehow store the number of lines shuf chose? – Pio
As I did in How to efficiently get 10% of random lines out of the large file in Linux?, you can do something like this:
shuf -i 1-1000 -n 5 > rand_numbers # store the list of numbers
awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' list_of_numbers a #print those lines
Upvotes: 4