Ger Cas
Ger Cas

Reputation: 2298

Print strings of file1 that appears in file2 in awk

I want to print the programming languages in file1 that appears in file2, its corresponding line number in file2 and the complete line of file2.

file1 is like this:

Ruby
Visual Basic
Objective-C
C
R
C++
Basic

file2 is like this:

5. ab cde fg Java hij kl
2. ab PHP dddf llf 
4. cde fg z o Objective-C oode
8. a12b cde JavaScript kdk
6. ab99r cde Visual Basic llso dkd
1. lkd dsk Ruby kksdk
3. Python dsdls
9. CSS dkdsk
4. Jdjdj C Jjd Kkd
12. Iiii Jjd R Hhd
5. Jjjff C++ jdjejd
7. Jfjfjdoo Uueye Basic Jje Tasdk

I´d like to get this output:

 6|Ruby|1. lkd dsk Ruby kksdk
 5|Visual Basic|6. ab99r cde Visual Basic llsodkd            
 3|Objective-C|4. cde fg z o Objective-C oode
 9|C|4. Jdjdj C Jjd Kkd  
 10|R|12. Iiii Jjd R Hhd 
 11|C++|5. Jjjff C++ jdjejd
 12|Basic|7. Jfjfjdoo Uueye Basic Jje Tasdk 

where 6,5 and 3 are the line number where "Ruby", "Visual Basic" and "Objective-C" appears within file2.

I've tried so far with the code below, but this code works only if file 2 has a list of exact matches when comparing with file1.

awk 'NR == FNR{a[$0];next} ($0 in a)' file1 file2

In this case the programming languages in file2 have some text before and after and I'm stuck in how to get the output i want.

Thanks in advance for any help.

Upvotes: 1

Views: 118

Answers (2)

Ed Morton
Ed Morton

Reputation: 203899

With GNU awk for sorted_in to search for the longest languages (e.g. Visual Basic) first and remove those from the current line as they're found so the shorter languages that are part of them (e.g. Basic) can't be found within them:

$ cat tst.awk
BEGIN { OFS="|" }
NR==FNR {
    lengths[$0] = length($0)
    next
}
{
    line = " " $0 " "
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (lang in lengths) {
        if ( s = index(line," "lang" ") ) {
            print FNR, lang, $0
            line = substr(line,1,s) substr(line,s+1+lengths[lang])
        }
    }
}

$ awk -f tst.awk file1 file2
3|Objective-C|4. cde fg z o Objective-C oode
5|Visual Basic|6. ab99r cde Visual Basic llso dkd
6|Ruby|1. lkd dsk Ruby kksdk

$ cat file1
Ruby
Visual Basic
Objective-C
C
C++
Basic

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133600

Could you please try following(changed index use in code as per @Ed Morton sir's suggestions).

awk -v OFS='|' '
FNR==NR{
  a[$0]
  next
}
{
  for(i in a){
     if(index(" "$0" "," "i" ")){
         print FNR,i,$0
     }
  }
}
'  Input_file1  Input_file2 | sort -t'|' -nr

Output will be as follows.

6|Ruby|1. lkd dsk Ruby kksdk
5|Visual Basic|6. ab99r cde Visual Basic llso dkd
3|Objective-C|4. cde fg z o Objective-C oode

Explanation: Adding explanation for above code now.

awk -v OFS='|"' '                           ##Starting awk program here.
FNR==NR{                                   ##Checking condition FNR==NR which will be TRUE when first Input_file is being read.
  a[$0]                                 ##creating an array named a whose index is $0 and value is $0.
}
{                                          ##Starting block here.
  for(i in a){                             ##Starting a for loop here.
     if(index(" "$0" "," "i" ")){                   ##checking if value of a[i] array present in current line.
         print FNR,i,$0             ##If above is TRUE then print FNR"|"i"|"$0 as per OP need.
     }
  }
}
'  file1  file2 | sort -t'|' -nr           ##Mentioning Input_files names here and passing its output into sort command and sorting it with reverse order.

Upvotes: 3

Related Questions