gabriel32
gabriel32

Reputation: 1

Merge two files under a certain condition

File 1 (big file):

A B C TIMESTAMP  
4 4 4 12344653380  
5 5 5 12344653401  
6 6 6 12344653411  
7 7 7 12344653433  
8 8 8 12344653460  
9 9 9 12344653466 

File 2 (small file)(samples from every 30 sec):

D E F TIMESTAMP  
1 1 1 12344653400  
2 2 2 12344653430  
3 3 3 12344653460  

File 3:

merge every line where timestamp is between the interval of TIMESTAMP on file 2

A B C TIMESTAMP   D E F   
5 5 5 12344653401 1 1 1  
6 6 6 12344653411 1 1 1  
7 7 7 12344653433 2 2 2  
8 8 8 12344653460 3 3 3  

I have found lots of answers like these How to merge two files using AWK? but they always focus on a match.

Upvotes: 0

Views: 375

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133518

Could you please try following and let me know if this helps you.

awk 'FNR==NR && FNR>1{MIN=MIN>$NF?NF:(MIN?MIN:$NF);MAX=MAX>$NF?MAX:$NF;next} ($NF>=MIN && $NF<=MAX)' file2 file1

You could take above's output into file3.


Explanation of awk command above:

awk 'FNR==NR && FNR>1{

FNR and NR are 2 awk's built-in variables which will let us know the number of lines for any file, so the NR variable's value will be increasing until the last file being read. As we know, awk can read multiple files at a time, but unlike NR, FNR's value will be reset at the beginning of each new file being read. I give the condition FNR==NR which means this condition will only be TRUE when first file is being read. Additionally, I am using FNR>1 to make sure file 2's first line shouldn't be read -- as it has headers in it instead of timings. So, when both of these are TRUE the next actions should take place.

MIN=MIN>$NF?NF:(MIN?MIN:$NF);

Create a variable named MIN which has a condition if MIN's value is greater than $NF (which is last field in awk, where $ refers to the field's value and NF refers to the number of fields). If this condition is TRUE then the action after wild character ? will be done, which is changing MIN's value to $NF's. If that condition is NOT TRUE then the action/condition after : will be executed. So, here again there is one condition which checks if MIN is NULL then keeps it's value to $NF else keep MIN value as it is.

MAX=MAX>$NF?MAX:$NF;

Creating a variable named MAX, and add a condition to check if MAX's value is greater than $NF's value, if this is TRUE then action after ?, which is to keep MAX's value as it is, will be executed. If FALSE then the action after :, which is to change it's value to $NF, will be executed.

next}

Using awk's built in keyword next here which means leave all next statements further so awk's cursor will move to the beginning of the statements.

($NF>=MIN && $NF<=MAX)

Now we need to check a condition (which will be only executed when first file named file2 is being read completely), as described above the next statement, this will not allow it to execute this condition. Check here if $NF's value is greater or equal to variable MIN's value and less or equal to MAX's value then print the current line, though there is nothing written to print here. awk works on condition/action, so if a condition is TRUE, then certain actions should be placed. No action is described here, so the default action of printing the current line will happen.

' file2 file1

Mentioning the Input_files named file2 and file1 here.

Upvotes: 3

codeforester
codeforester

Reputation: 42999

Try this:

  1. read file 2 into an array (assuming the file has data sorted by timestamp, if not sort the array)
  2. read file 1 (pre-sort it by timestamp if required) line by line; for each line, do binary search on the array to find the row that falls in the interval, merge the found line with the line from file 1

Upvotes: 0

Related Questions