Sort and count number of occurrence of lines

Question

I have a 35GB file with various strings example:

test1
test2
test1
test34!
test56
test56
test896&
test1
test4
etc
...

There are several billion lines.

I want to sort them and count occurrences, but it took 2 days and it was not done by then.

This is what I've used in bash:

cat file.txt | sort | uniq -c | sort -nr

Is there a more efficient way in doing it? Or is there a way I can see a progress, or would it just load my computer even more and would make it even slower?

James Brown · Accepted Answer

If there are a lot of duplicates, ie. if the unique lines would fit in your available memory, you could count the lines and sort using GNU awk:

$ awk '{
    a[$0]++                                # hash the lines and count
}
END {                                      # after counting the lines
    PROCINFO["sorted_in"]="@val_num_desc"  # used for traverse order 
    for(i in a)
        print a[i],i
}' file

Output for your sample data:

3 test1
2 test56
1 test34!
1 test2
1 test4
1 etc
1 test896&
1 ...

Related documentation: https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html

Update Since the memory wasn't big enough (see comments), split file on 0-2 first characters of the line. The distribution will not be even:

$ awk '{
    ch=substr($0,match($0,/^.{0,2}/),RLENGTH)  # 0-2 first chars
    if(!(ch in a))                             # if not found in hash
        a[ch]=++i                              # hash it and give a unique number
    filename=a[ch]".txt"                       # which is used as filename
    print >> filename                          # append to filename
    close(filename)                            # close so you wont run out of fds
}' file

Output with your test data:

$ ls -l ?.txt
-rw-rw-r-- 1 james james 61 May 13 14:18 1.txt
-rw-rw-r-- 1 james james  4 May 13 14:18 2.txt
-rw-rw-r-- 1 james james  4 May 13 14:18 3.txt
$ cat 3.txt
...

300 MBs and 1.5 M lines in 50 seconds. If I removed the close() it only took 5 seconds but you risk running out of file descriptors. I guess you could increase the amount.

Sort and count number of occurrence of lines

Answers (1)

Related Questions