temp0706
temp0706

Reputation: 49

Multiple Big file sort

I have two files that each line order by timestamp but has different structure. I want merge there file info one single file and order by timestamp. look like:

file A(less than 2G)

1,1,1487779199850
2,2,1487779199852
3,3,1487779199854
4,4,1487779199856
5,5,1487779199858

file B(less than 15G)

1,1,10,100,1487779199850
2,2,20,200,1487779199852
3,3,30,300,1487779199854
4,4,40,400,1487779199856
5,5,50,500,1487779199858

how can I accomplish this? is there any way can make it as fast as possible?

Upvotes: 1

Views: 112

Answers (2)

Ed Morton
Ed Morton

Reputation: 204124

$ awk -F, -v OFS='\t' '{print $NF, $0}' fileA fileB | sort -s -n -k1,1 | cut -f2-
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858

I originally posted the above as just a comment under @VM17's answer but (s)he suggested I make it a new answer.

The above would be more robust and efficient since it's using the default separator for sort+cut (tab), will truly only sort on the first key (his would use the whole line despite the -k1 since sorts field separator tab isn't present in the line), uses a stable sort algorithm (sort -s) to preserve input order and uses cut to strip off the added key field which is more efficient than invoking awk again since awk does field splitting etc. on each record which isn't needed to just remove the leading field(s).

Alternatvely you might find something like this more efficient:

$ cat tst.awk
{ currRec = $0; currKey = $NF }
NR>1 {
    print prevRec
    printf "%s", saved
    while ( (getline < "fileB") > 0 ) {
        if ($NF < currKey) {
            print
        }
        else {
            saved = $0 ORS
            break
        }
    }
}
{ prevRec = currRec; prevKey = currKey }
END {
    print prevRec
    printf "%s", saved
    while ( (getline < "fileB") > 0 ) {
        print
    }
}

$ awk -f tst.awk fileA
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858

As you can see it reads from fileB between reads of lines fileA comparing timestamps so it's interleaving the 2 files and so doesn't require a subsequent pipe to sort and cut.

Just check the logic as I didn't think about it very much and be aware that this is a rare situation where getline might be appropriate for efficiency but make sure to read http://awk.freeshell.org/AllAboutGetline to understand all it's caveats if you're ever considering using it again.

Upvotes: 1

Chem-man17
Chem-man17

Reputation: 1770

Try this-

awk -F, '{print $NF, $0}' fileA fileB | sort -nk 1 | awk '{print $2}'

Output-

1,1,10,100,1487779199850
1,1,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858

This concatenates the two files and then puts the timestamp at the starting of the line. It then sorts according to the timestamp and then removes that dummy column.

This will be slow for big files though.

Upvotes: 0

Related Questions