Shreyash
Shreyash

Reputation: 456

Combine the data of two files having same column using awk

I have two files: file1 and file2. i want to combine the time between file1 and file2 with common column as 1 and 2. I need to create a separate file which contains the file1 data and file2 mapping time column. But for some entires(see BOLD TEXT) its matching the same line from file2. Whats wrong in my command? is there any way to get the expected output?

Note: there can be more than 2 entries as shown in the bold text. column1 and column2 entries can be same while only time changes.

file 1

9950 | p | 2021-02-27 14:16:21.905
9951 | q | 2021-02-27 14:16:28.845
9954 | r | 2021-02-27 14:17:21.884
9954 | r | 2021-02-27 14:19:46.775
9956 | y | 2021-02-27 14:20:20.830
9957 | z | 2021-02-27 14:20:27.325

file 2

9950 | p | 2021-02-27 14:16:22.086
9951 | q | 2021-02-27 14:16:29.033
9954 | r | 2021-02-27 14:17:22.072
9954 | r | 2021-02-27 14:19:46.956
9956 | y | 2021-02-27 14:20:21.018
9957 | z | 2021-02-27 14:20:27.512

command:

 awk -F'|' 'NR==FNR{a[$1,$2]=$3;next} a[$1,$2]{print $0"\t"a[$1,$2]}' file2 file1

Output:

9950 | p | 2021-02-27 14:16:21.905 | 2021-02-27 14:16:22.086
9951 | q | 2021-02-27 14:16:28.845 | 2021-02-27 14:16:29.033
9954 | r | 2021-02-27 14:17:21.884 | 2021-02-27 14:19:46.956
9954 | r | 2021-02-27 14:19:46.775 | 2021-02-27 14:19:46.956
9956 | x | 2021-02-27 14:20:20.830 | 2021-02-27 14:20:21.018
9957 | y | 2021-02-27 14:20:27.325 | 2021-02-27 14:20:27.512

Expected output:

9950 | p | 2021-02-27 14:16:21.905 | 2021-02-27 14:16:22.086
9951 | q | 2021-02-27 14:16:28.845 | 2021-02-27 14:16:29.033
9954 | r | 2021-02-27 14:17:21.884 | 2021-02-27 14:17:22.072
9954 | r | 2021-02-27 14:19:46.775 | 2021-02-27 14:19:46.956
9956 | x | 2021-02-27 14:20:20.830 | 2021-02-27 14:20:21.018
9957 | y | 2021-02-27 14:20:27.325 | 2021-02-27 14:20:27.512

Upvotes: 1

Views: 310

Answers (3)

Ed Morton
Ed Morton

Reputation: 203522

If this isn't all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for:

$ paste file1 file2 | awk '{$7=$8=$9=""; $0=$0; $1=$1}1'
9950 | p | 2021-02-27 14:16:21.905 | 2021-02-27 14:16:22.086
9951 | q | 2021-02-27 14:16:28.845 | 2021-02-27 14:16:29.033
9954 | r | 2021-02-27 14:17:21.884 | 2021-02-27 14:17:22.072
9954 | r | 2021-02-27 14:19:46.775 | 2021-02-27 14:19:46.956
9956 | y | 2021-02-27 14:20:20.830 | 2021-02-27 14:20:21.018
9957 | z | 2021-02-27 14:20:27.325 | 2021-02-27 14:20:27.512

In the above $7=$8=$9="" set those fields to NULL strings but does not change how many fields are in $0, then $0=$0 causes awk to resplit $0 thereby removing those as individual fields but not altering the white space around where they existed, then $1=$1 causes awk to reconstruct $0 from its fields replacing every string that matches FS (i.e. contiguous white space) with the string that matches OFS (i.e. a single blank char). For example (with | tr ' ' '-' added to make the spaces more clearly visible):

$ echo 'a b c' | awk '{print NF, $0}' | tr ' ' '-'
3-a-b-c
$ echo 'a b c' | awk '{$2=""; print NF, $0}' | tr ' ' '-'
3-a--c
$ echo 'a b c' | awk '{$2=""; $0=$0; print NF, $0}' | tr ' ' '-'
2-a--c
$ echo 'a b c' | awk '{$2=""; $0=$0; $1=$1; print NF, $0}' | tr ' ' '-'
2-a-c

Upvotes: 3

James Brown
James Brown

Reputation: 37404

$ awk '
BEGIN {
    FS=" [|] "
    OFS=" | "
}
NR==FNR {
    a[$1,++c[$1]]=$0          # c[$1] is an instance counter array, 
    next                      # unique for each $1
}
{
    print a[$1,++d[$1]],$NF   # so is d[$1]
}' file1 file2

Output:

9950 | p | 2021-02-27 14:16:21.905 | 2021-02-27 14:16:22.086
9951 | q | 2021-02-27 14:16:28.845 | 2021-02-27 14:16:29.033
9954 | r | 2021-02-27 14:17:21.884 | 2021-02-27 14:17:22.072
9954 | r | 2021-02-27 14:19:46.775 | 2021-02-27 14:19:46.956
...

Updated with request from comments:

$ awk '
BEGIN {
    FS=" [|] "
    OFS=" | "
}
NR==FNR {
    a[$1,++c[$1]]=$0          # c[$1] is an instance counter array, 
    next                      # unique for each $1
}
(($1,++d[$1]) in a) {         # only output if match in both files
    print a[$1,d[$1]],$NF     # so is d[$1]
}' file1 file2

Upvotes: 3

glenn jackman
glenn jackman

Reputation: 246807

Similar to James Brown's answer, but only needing to keep one "count" array, and using $1 plus $2 for the key

gawk -F '|' '
    FNR == 1 {delete count}
    {key = $1 SUBSEP $2 SUBSEP (++count[$1,$2])}
    FILENAME == ARGV[1] {time[key] = $3; next}
    key in time {print $0, FS, time[key]}
' file2 file1

Upvotes: 3

Related Questions