John B
John B

Reputation: 3646

Comparing split strings inside fields of two CSV files

I have a CSV file (file1) that looks something like this:

123,info,ONE NAME
124,info,ONE VARIATION
125,info,NAME ANOTHER
126,info,SOME TITLE

and another CSV file (file2) that looks like this:

1,info,NAME FIRST
2,info,TWO VARIATION
3,info,NAME SECOND
4,info,ANOTHER TITLE

My desired output would be:

1,123,NAME FIRST,ONE NAME
3,125,NAME SECOND,NAME ANOTHER

Where if the first word in comma delimited field 3 (ie: NAME in line 1) of file2 is equal to any of the words in field 3 of file1, print a line with format:

field1(file2),field1(file1),field3(file2),field3(file1)

Each file has the same number of lines and matches are only made when each has the same line number.

I know I can split fields and get the first word in field3 in Awk like this:

awk -F"," '{split($3,a," "); print a[1]}' file

But since I'm only moderately competent in Awk, I'm at a loss for how to approach a job where there are two files compared using splits.

I could do it in Python like this:

with open('file1', 'r') as f1, open('file2', 'r') as f2:
    l1 = f1.readlines()
    l2 = f2.readlines()
    for i in range(len(l1)):
        line_1 = l1[i].split(',')
        line_2 = l2[i].split(',')
        field_3_1 = line_1[2].split()
        field_3_2 = line_2[2].split()
        if field_3_2[0] in field_3_1:
            one = ' '.join(field_3_1)
            two = ' '.join(field_3_2)
            print(','.join((line_2[0], line_1[0], two, one)))

But I'd like to know how a job like this would be done in Awk as occasionally I use shells where only Awk is available.

This seems like a strange task to need to do, and my example I think can be a bit confusing, but I need to perform this to check for broken/ill-formatted data in one of the files.

Upvotes: 0

Views: 177

Answers (3)

ooga
ooga

Reputation: 15501

awk -F, -vOFS=, '
{
  num1 = $1
  name1 = $3
  split(name1, words1, " ")
  getline <"file2"
  split($3, words2, " ")
  for (i in words1)
    if (words2[1] == words1[i]) {
      print $1, num1, $3, name1
      break
    }
}
' file1

Output:

1,123,NAME FIRST,ONE NAME
3,125,NAME SECOND,NAME ANOTHER

Upvotes: 2

glenn jackman
glenn jackman

Reputation: 247042

I like @ooga's answer better than this:

awk -F, -v OFS=, '
    NR==FNR {
        split($NF, a, " ")
        data[NR,"word"] = a[1]
        data[NR,"id"] = $1
        data[NR,"value"] = $NF
        next
    }
    {
        n = split($NF, a, " ")
        for (i=1; i<=n; i++)
            if (a[i] == data[FNR,"word"])
                print data[FNR,"id"], $1, data[FNR,"value"], $NF
    }
' file2 file1

Upvotes: 1

plotinus
plotinus

Reputation: 31

You can try something along the lines, although the following prints only one match for each line in second file:

 awk -F, 'FNR==NR {
    count= split($3, words, " ");
    for (i=1; i <= count; i++) { 
        field1hash[words[i]]=$1; 
        field3hash[$1]=$3; 
    }
    next;
  }
  {
     split($3,words," "); 
     if (field1hash[words[1]]) { 
         ff1 =  field1hash[words[1]]; 
         print $1","ff1","$3","field3hash[ff1]
      }
  }' file1 file2

Upvotes: 1

Related Questions