Reputation: 8783
I have a csv looking like this:
user, score
bob, 10
alice, 12
peter, 20
bruce, 32
...
The file is quite big and I'd like to get the top 50 users with the highest score.
How can I do this with a simple shell command?
Upvotes: 0
Views: 398
Reputation: 131
The currently accepted answer is:
sort -t',' -k2 -nr file | head -50
This uses a ton of memory; gnu sort reads the file, then sorts it, then sends it to the output. It could do things differently if it had downstream context, like dropping lines that are no longer needs.
That's where choose comes in (I'm the author). Here's the equivalent command:
cat file | tail -n +2 | choose --out=50 --sort-reverse -n --field '^[^,]*+. \K.*+'
Explanation:
cat file
: read the file
tail -n +2
: remove the header line (accepted answer doesn't do this but it's needed for correctness)
--out=50
: limit the output to the first 50 lines
--sort-reverse -n
: sort in reverse numeric order
--field '^[^,]*+. \K.*+'
: match the number on each line. the expression is tailored for the question's specific file format. Other examples include:
^[^,]*+.\K[^,]*+
match the second field of a csv^(?>(?:[^,]*+.){N})\K[^,]*+
match the Nth field (replace N)From now on, the header is ommited, but tail
really should be used to get rid of the header line.
For the next benchmarks, let's generate a test file:
# generate 10 million lines, 114M size
(for i in {1..10000000} ; do
# V deliberate space, following original question
echo hi, $i
done) > file
cat file | shuf -o file
Measurements are done with the time
command:
$ >choose_out.csv <file command time --verbose -- choose --sort-reverse -n --out=50 --field '^[^,]*+. \K.*+'
Command being timed: "choose --sort-reverse -n --out=50 --field ^[^,]*+. \K.*+"
User time (seconds): 12.06
System time (seconds): 0.06
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.13
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4480
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 193
Voluntary context switches: 1
Involuntary context switches: 73
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ command time --verbose -- sort -t',' -k2 -nr file | head -50 >sort_out.csv
Command terminated by signal 13
Command being timed: "sort -t, -k2 -nr file"
User time (seconds): 48.07
System time (seconds): 1.46
Percent of CPU this job got: 332%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.90
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 675968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 168596
Voluntary context switches: 16
Involuntary context switches: 3528
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
cmp -s choose_out.csv sort_out.csv && echo "same" || echo "different" # same
choose | sort | |
---|---|---|
Elapsed (wall clock) time | 0:12.13 | 0:14.90 |
User time | 12.06 | 48.07 |
Maximum resident set size | 4480 | 675968 |
For this test case choose is faster in elapsed time, faster in CPU time, and uses way less memory.
The elapsed time comparison will vary based on hardware, but the general idea of using less CPU and less memory is constant.
Upvotes: 0
Reputation: 289775
You probably want to say:
sort -t',' -k2 -nr file
For your given input it returns:
bruce, 32
peter, 20
alice, 12
bob, 10
user, score
This uses sort
with these parameters:
-t','
to set the field separator to the comma.-k2
to indicate that sort
has to work with the 2nd column.-nr
to indicate that you want to sort numerically and in r
everse (first the highest).To get just the top 50, you can pipe to head -50
, which will get the first 50 lines of the output of the previous command:
sort -t',' -k2 -nr file | head -50
Upvotes: 4