Jiao
Jiao

Reputation: 219

awk search content if it contains content in a list file

I am meeting some difficulties when using AWK to search a huge csv file (this can be called file1). fortunately, I have a list file (this can be called file2). I can just search the rows that I need basing on the index list file in file2. however, the file1 is not like any other normal file, it's like:

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID3, P01723;F08734;,ID3_name
ID4, AC0014;AC0114;P01112;,ID4_name
...
IDn, AC0006;,IDn_name
IDm, Ac8007; P01167;,IDm_name

the index file2 like:

AC000112
AC000801
P01112
P01167

the desired output should be:

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name

if I use

awk -F, 'NR==FNR{a[$1]; next} ($2 in a)' file2 file1

I will get nothing, if I add ";" at the end of each line in file2, I will only get ID2, AC000801;,ID2_name. and if I change $2 ~ a[$1], it still didn't work.

So, I was wondering how to change this command to get the desired result. Thanks!

Upvotes: 1

Views: 164

Answers (4)

dan
dan

Reputation: 5241

If you aren't restricted to awk, I would use grep for this task:

grep -Fwf file2 file1
  • -f file2: use each line of file2 as a search string.

  • -w: match whole words only (such that pattern P01167 won't match P011670). Any character other than letters, digits and underscore delimit a word (so P01167;, will match).

  • -F: fixed strings - match the string exactly, such that any regex characters have no special meaning.

Upvotes: 1

markp-fuso
markp-fuso

Reputation: 34966

Assumptions:

  • the search strings only consist of characters and numbers

One GNU awk idea where we append word boundary flags to our search patterns and then perform regex comparisons:

awk -F',' '
FNR==NR { regs["\\<" $1 "\\>"]; next }
        { for (regex in regs)
              if ($2 ~ regex) { print; next }
        }
' file2 file1

This generates:

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133610

With your shown samples, please try following awk code.

awk -F',|[[:space:]]+|;' '
FNR==NR{
  for(i=2;i<=NF;i++){
    arr[$i]=$0
  }
  next
}
($0 in arr){
  print arr[$0]
}
' file1 file2

Explanation: Adding detailed explanation for above code.

awk -F',|[[:space:]]+|;' '  ##Setting field separator as comma, space(s), semi-colon here.
FNR==NR{                    ##This condition will be TRUE when file1 is being read.
  for(i=2;i<=NF;i++){       ##Using for loop to traverse from 2nd field to till last field.
    arr[$i]=$0              ##Creating arr with index of current field, with value of current line.
  }
  next                      ##next will skip all further lines from here.
}
($0 in arr){                ##Checking condition if current line is present in arr.
  print arr[$0]             ##Printing arr with index of $0 here.
}
' file1 file2               ##Mentioning Input_file names here.

Upvotes: 3

The fourth bird
The fourth bird

Reputation: 163467

You could set the field separator to a comma followed by optional spaces [[:space:]]*,[[:space:]]*

Then you can split the second field of file 1 on a semicolon and optional spaces [[:space:]]*;[[:space:]]* and check if one of those is present in a

awk -F"[[:space:]]*,[[:space:]]*" 'NR==FNR{
  a[$1]; next
}
{
  split($2, parts, /[[:space:]]*;[[:space:]]*/)
  for (i in parts) {
    if (parts[i] in a) {
      print $0; break;
    }
  } 
}
' file2 file1

Output

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name

Upvotes: 3

Related Questions