aheywood
aheywood

Reputation: 21

Match strings in two files using awk and regexp

I have two files.

File 1 includes various types of SeriesDescriptions

"SeriesDescription": "Type_*"
"SeriesDescription": "OtherType_*"
...

File 2 contains information with only one SeriesDescription

"Name":"Joe"
"Age":"18"
"SeriesDescription":"Type_(Joe_text)"
...

I want to

  1. compare the two files and find the lines that match for "SeriesDescription" and
  2. print the line number of the matched text from File 1.

Expected Output:

"SeriesDescription": "Type_*" 24 11 (the correct line numbers in my files)

"SeriesDescription" will always be found on line 11 of File 2. I am having trouble matching given the * and have also tried changing it to .* without luck.

Code I have tried:

grep -nf File1.txt File2.txt

Successfully matches, but I want the line number from File1

awk 'FNR==NR{l[$1]=NR; next}; $1 in l{print $0, l[$1], FNR}' File2.txt File1.txt

This finds a match and prints the line number from both files, however, this is matching on the first column and prints the last line from File 1 as the match (since every line has the same column 1 for File 1).

awk 'FNR==NR{l[$2]=$3;l[$2]=NR; next}; $2 in l{print $0, l[$2], FNR}' File2.txt File1.txt

Does not produce a match.

I have also tried various settings of FS=":" without luck. I am not sure if the trouble is coming from the regex or the use of "" in the files or something else. Any help would be greatly appreciated!

Upvotes: 1

Views: 523

Answers (1)

RavinderSingh13
RavinderSingh13

Reputation: 133750

With your shown samples, please try following. Written and tested in GNU awk, should work in any awk.

awk '
{ val="" }
match($0,/^[^_]*_/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/[[:space:]]+/,"",val)
}
FNR==NR{
  if(val){
    arr[val]=$0 OFS FNR
  }
  next
}
(val in arr){
  print arr[val] OFS FNR
}
' SeriesDescriptions file2

With your shown samples output will be:

"SeriesDescription": "Type_*" 1 3

Explanation: Adding detailed explanation for above.

awk '                            ##Starting awk program from here.
{ val="" }                       ##Nullifying val here.
match($0,/^[^_]*_/){             ##Using match to match value till 1st occurrence of _ here.
  val=substr($0,RSTART,RLENGTH)  ##Creating val which has sub string of above matched regex.
  gsub(/[[:space:]]+/,"",val)    ##Globally substituting spaces with NULL in val here.
}
FNR==NR{                         ##This will execute when first file is being read.
  if(val){                       ##If val is NOT NULL.
    arr[val]=$0 OFS FNR          ##Create arr with index of val, which has value of current line OFS and FNR in it.
  }                       
  next                           ##next will skip all further statements from here.
}
(val in arr){                    ##Checking if val is present in arr then do following.
  print arr[val] OFS FNR         ##Printing arr value with OFS, FNR value.
}
' SeriesDescriptions file2       ##Mentioning Input_file name here.

Bonus solution: If above is working fine for you AND you have this match only once in your file2 then you can exit from program to make it quick, in that case have above code in following way.

awk '
{ val="" }
match($0,/^[^_]*_/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/[[:space:]]+/,"",val)
}
FNR==NR{
  if(val){
    arr[val]=$0 OFS FNR
  }
  next
}
(val in arr){
  print arr[val] OFS FNR
  exit
}
' SeriesDescriptions file2

Upvotes: 3

Related Questions