Multiple Big file sort

Question

I have two files that each line order by timestamp but has different structure. I want merge there file info one single file and order by timestamp. look like:

file A(less than 2G)

1,1,1487779199850
2,2,1487779199852
3,3,1487779199854
4,4,1487779199856
5,5,1487779199858

file B(less than 15G)

1,1,10,100,1487779199850
2,2,20,200,1487779199852
3,3,30,300,1487779199854
4,4,40,400,1487779199856
5,5,50,500,1487779199858

how can I accomplish this? is there any way can make it as fast as possible?

Ed Morton · Accepted Answer

$ awk -F, -v OFS='	' '{print $NF, $0}' fileA fileB | sort -s -n -k1,1 | cut -f2-
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858

I originally posted the above as just a comment under @VM17's answer but (s)he suggested I make it a new answer.

The above would be more robust and efficient since it's using the default separator for sort+cut (tab), will truly only sort on the first key (his would use the whole line despite the -k1 since sorts field separator tab isn't present in the line), uses a stable sort algorithm (sort -s) to preserve input order and uses cut to strip off the added key field which is more efficient than invoking awk again since awk does field splitting etc. on each record which isn't needed to just remove the leading field(s).

Alternatvely you might find something like this more efficient:

$ cat tst.awk
{ currRec = $0; currKey = $NF }
NR>1 {
    print prevRec
    printf "%s", saved
    while ( (getline < "fileB") > 0 ) {
        if ($NF < currKey) {
            print
        }
        else {
            saved = $0 ORS
            break
        }
    }
}
{ prevRec = currRec; prevKey = currKey }
END {
    print prevRec
    printf "%s", saved
    while ( (getline < "fileB") > 0 ) {
        print
    }
}

$ awk -f tst.awk fileA
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858

As you can see it reads from fileB between reads of lines fileA comparing timestamps so it's interleaving the 2 files and so doesn't require a subsequent pipe to sort and cut.

Just check the logic as I didn't think about it very much and be aware that this is a rare situation where getline might be appropriate for efficiency but make sure to read http://awk.freeshell.org/AllAboutGetline to understand all it's caveats if you're ever considering using it again.

Multiple Big file sort

Answers (2)

Related Questions