Reputation: 8783

Sort a csv in shell and keep only the top values

I have a csv looking like this:

user, score
bob, 10
alice, 12
peter, 20
bruce, 32
...

The file is quite big and I'd like to get the top 50 users with the highest score.

How can I do this with a simple shell command?

Upvotes: 0

Answers (2)

jagprog5

Reputation: 131

The currently accepted answer is:

sort -t',' -k2 -nr file | head -50

This uses a ton of memory; gnu sort reads the file, then sorts it, then sends it to the output. It could do things differently if it had downstream context, like dropping lines that are no longer needs.

That's where choose comes in (I'm the author). Here's the equivalent command:

cat file | tail -n +2 | choose --out=50 --sort-reverse -n --field '^[^,]*+. \K.*+'

Explanation:

cat file: read the file
tail -n +2: remove the header line (accepted answer doesn't do this but it's needed for correctness)
--out=50: limit the output to the first 50 lines
--sort-reverse -n: sort in reverse numeric order
--field '^[^,]*+. \K.*+': match the number on each line. the expression is tailored for the question's specific file format. Other examples include:
- ^[^,]*+.\K[^,]*+ match the second field of a csv
- ^(?>(?:[^,]*+.){N})\K[^,]*+ match the Nth field (replace N)

From now on, the header is ommited, but tail really should be used to get rid of the header line.

For the next benchmarks, let's generate a test file:

# generate 10 million lines, 114M size 
(for i in {1..10000000} ; do
    #       V deliberate space, following original question
    echo hi, $i
done) > file
cat file | shuf -o file

Measurements are done with the time command:

$ >choose_out.csv <file command time --verbose -- choose --sort-reverse -n --out=50 --field '^[^,]*+. \K.*+'
    Command being timed: "choose --sort-reverse -n --out=50 --field ^[^,]*+. \K.*+"
    User time (seconds): 12.06
    System time (seconds): 0.06
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.13
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 4480
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 193
    Voluntary context switches: 1
    Involuntary context switches: 73
    Swaps: 0
    File system inputs: 0
    File system outputs: 8
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

$ command time --verbose -- sort -t',' -k2 -nr file | head -50 >sort_out.csv
Command terminated by signal 13
    Command being timed: "sort -t, -k2 -nr file"
    User time (seconds): 48.07
    System time (seconds): 1.46
    Percent of CPU this job got: 332%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.90
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 675968
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 168596
    Voluntary context switches: 16
    Involuntary context switches: 3528
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

cmp -s choose_out.csv sort_out.csv && echo "same" || echo "different" # same

	choose	sort
Elapsed (wall clock) time	0:12.13	0:14.90
User time	12.06	48.07
Maximum resident set size	4480	675968

For this test case choose is faster in elapsed time, faster in CPU time, and uses way less memory.

The elapsed time comparison will vary based on hardware, but the general idea of using less CPU and less memory is constant.

Upvotes: 0

fedorqui

Reputation: 289775

You probably want to say:

sort -t',' -k2 -nr file

For your given input it returns:

bruce, 32
peter, 20
alice, 12
bob, 10
user, score

This uses sort with these parameters:

-t',' to set the field separator to the comma.
-k2 to indicate that sort has to work with the 2nd column.
-nr to indicate that you want to sort numerically and in reverse (first the highest).

To get just the top 50, you can pipe to head -50, which will get the first 50 lines of the output of the previous command:

sort -t',' -k2 -nr file | head -50

Upvotes: 4

Sort a csv in shell and keep only the top values

Answers (2)

Related Questions