palomo11
palomo11

Reputation: 75

merge two files based on partial matching

I have two files

FileA.txt

ID
479432_Sros_4274
330214_NIDE2792
517722_CJLT1_010100003977
257310_BB0482
...

FileB.txt (The ** is only to help you to identify the matches)

members   category
6085.XP_002168109,**479432_Sros_4274**,4956.XP_002495993.1,457425.SSHG_03214,51511.ENSCSAVP000  P
7159.AAEL006372-PA,**257310_BB0482** J
**517722_CJLT1_010100003977**,701176.VIBRN418_17773,9785.ENSLAFP00000010769,28377.ENSACAP00000014901,4081.Solyc03g120250.2.1,3847.GLYMA18G02240.1 U
500485.XP_002561312.1,1042876.PPS_0730,222929.XP_003071446.1,**330214_NIDE2792**  S
...

Expected output

Output.txt

ID  category
479432_Sros_4274  P
330214_NIDE2792  S
517722_CJLT1_010100003977  U
257310_BB0482  J
...

I have tried some code in awk and R based on answers to other questions, but I could not get the desired output.

Upvotes: 1

Views: 609

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133438

Could you please try following.

awk '
BEGIN{
  print "ID  category"
}
FNR==NR{
  a[$0]
  next
}
{
  for(i in a){
    if(match($0,i)){
      print i,$NF
    }
  }
}
'  Input_filea   Input_fileb

Explanation: Adding explanation for above code.

awk '                               ##Starting awk program here.
BEGIN{                              ##Starting BEGIN section from here.
  print "ID  category"              ##Printing string ID, category here.
}                                   ##Closing BLOCK for BEGIN section.
FNR==NR{                            ##Checking condition FNR==NR which will be TRUE when 1st Input_file is being read.
  a[$0]                             ##Creating an array named a whose index is $).
  next                              ##next will skip all further statements from here.
}
{
  for(i in a){                      ##Traversing through array a with for loop.
    if(match($0,i)){                ##Checking condition if match is having a proper regex matched then do following.
      print i,$NF                   ##Printing variable i and $NF of current line.
    }
  }
}
'  Input_filea   Input_fileb        ##Mentioning Input_file names here.

Upvotes: 3

James Brown
James Brown

Reputation: 37394

This is one way of doing it:

$ awk '
NR==FNR {                  # process file1
    if(FNR==1)             # print header, no newline
        printf $1
    a[$1]                  # hash data
    next
}
{                          # process file2
    if(FNR==1)             # print the other half of the header
        print OFS $2
    for(i in a)            # loop all items in hash
        if($1 ~ i)         # check for partial match
            print i,$2     # if found, output
}' file1 file2             # mind the order

Output (in file2 order, notice the partial match of in the last line of output, left as a warning):

ID category
479432_Sros_4274 P
257310_BB0482 J
517722_CJLT1_010100003977 U
330214_NIDE2792 S
ID S

Upvotes: 4

Related Questions