VIPIN KUMAR
VIPIN KUMAR

Reputation: 3137

How to join 3 files using awk?

Below are three files -

emp.txt
7839|BLAKE|PRESIDENT||17-NOV-81|5000||10
7698|KING|MANAGER|7839|01-MAY-81|2850||10
7782|CLARK|MANAGER|7839|09-JUN-81|2450||10
7566|JONES|MANAGER|7839|02-APR-81|2975||40
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50
7902|FORD|ANALYST|7566|03-DEC-81|3000||20


dept.txt
10|ACCOUNTING|NEW YORK
20|RESEARCH|DALLAS
30|SALES|CHICAGO
40|OPERATIONS|BOSTON

citycode.txt
1123|NEW YORK
1124|DALLAS
1125|CHICAGO
1126|BOSTON
1127|WASHINGTON

expected output : $0(from emp.txt where $6 >= 2850)+$2(from dept.txt)+$1(from citycode.txt)

7839|BLAKE|PRESIDENT||17-NOV-81|5000||10|ACCOUNTING|1123
7698|KING|MANAGER|7839|01-MAY-81|2850||10|ACCOUNTING|1123
7566|JONES|MANAGER|7839|02-APR-81|2975||40|OPERATIONS|1126
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50|NULL|NULL
7902|FORD|ANALYST|7566|03-DEC-81|3000||20|RESEARCH|1124

Explanation - join these files using common key of emp.txt($NF : outer join on emp.txt--record no 50 doesn't have common that is why it will print NULL) and dept.txt($1) where emp($6 >= 2850) and use common column of dept.txt($NF) and citycode.txt($1). joining two files is an easy task using common key but how can we join three files together.

Upvotes: 4

Views: 571

Answers (3)

Akshay Hegde
Akshay Hegde

Reputation: 16997

awk 'BEGIN{ FS=OFS="|" }
     FNR==1{ f++ }
     f==1{ d[$1]=$2; c[$NF]=$1; next }
     f==2 && $NF in c{ c[c[$NF]]=$1; next }
     $6 >=2850{
       print $0, ($NF in d ? d[$NF] : "NULL" ),( $NF in c ? c[$NF] : "NULL")
   }' dept.txt citycode.txt emp.txt

Explanation

awk '                               # Call awk
      BEGIN{          
              FS=OFS="|"            # Set input and output separator
      }
      FNR==1{                       # FNR will be 1 for each file when awk reads first line
            f++                     # File counter
      }

      # use can also do FILENAME == "dept.txt" here but f==1 is easy

      f==1{                         # for first file dept.txt
            d[$1]=$2;               # set array d where index is $1 and value is $2 of file dept.txt 
            c[$NF]=$1;              # set array c where index is last field and value is first field of file dept.txt
            next                    # stop processing go to next line
      }

      # when f equal 2 that is when we read second file citycode.txt 
      # and last field of current file exists in array c
      # there is no point in reading extra line exists in citycode.txt
      # we just need whichever city exists in dept.txt
      # so $NF in c

      f==2 && $NF in c{             

           # modify array c by empid 
           # that is index will be the value of array c 
           # corresponding to last field of current file and 
           # array c value will be citycode
           # Its because to make it easy to access array c value by empid while
           # reading emp.txt file

            c[c[$NF]]=$1;           

           # gawk user can delete array element here like below
           # if you have gawk uncomment below line
           # delete c[$NF]

            next                   # stop processing go to next line
      }

      $6 >=2850{                   # here we read emp.txt if 6th field is greater than or equal to 2850

            # Print current record/row/line from emp.txt

            # if last field of current file that is 
            # empid exists in array d then print department else NULL

            # if last field of current file that is
            # empid exists in array c then print citycode else NULL

            print $0,($NF in d?d[$NF]:"NULL"),($NF in c?c[$NF]:"NULL")
      }
   ' dept.txt citycode.txt emp.txt

Input

$ cat emp.txt 
7839|BLAKE|PRESIDENT||17-NOV-81|5000||10
7698|KING|MANAGER|7839|01-MAY-81|2850||10
7782|CLARK|MANAGER|7839|09-JUN-81|2450||10
7566|JONES|MANAGER|7839|02-APR-81|2975||40
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50
7902|FORD|ANALYST|7566|03-DEC-81|3000||20

$ cat dept.txt 
10|ACCOUNTING|NEW YORK
20|RESEARCH|DALLAS
30|SALES|CHICAGO
40|OPERATIONS|BOSTON

$ cat citycode.txt 
1123|NEW YORK
1124|DALLAS
1125|CHICAGO
1126|BOSTON
1127|WASHINGTON

Output

$ awk 'BEGIN{FS=OFS="|"}FNR==1{f++}f==1{d[$1]=$2;c[$NF]=$1;next}f==2 && $NF in c{c[c[$NF]]=$1;next}$6 >=2850{print $0,($NF in d?d[$NF]:"NULL"),($NF in c?c[$NF]:"NULL")}' dept.txt citycode.txt emp.txt
7839|BLAKE|PRESIDENT||17-NOV-81|5000||10|ACCOUNTING|1123
7698|KING|MANAGER|7839|01-MAY-81|2850||10|ACCOUNTING|1123
7566|JONES|MANAGER|7839|02-APR-81|2975||40|OPERATIONS|1126
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50|NULL|NULL
7902|FORD|ANALYST|7566|03-DEC-81|3000||20|RESEARCH|1124

Upvotes: 3

jas
jas

Reputation: 10865

Use the built-in FILENAME variable to build the appropriate associative arrays when the city and department codes are being processed.

Then when processing the employees, output the desired data:

BEGIN {FS=OFS="|"}
function chknull(s) { return s ? s : "NULL"}
FILENAME == "citycode.txt" {citycode[$2]=$1}
FILENAME == "dept.txt" {depname[$1]=$2; depcity[$1]=$3}
FILENAME == "emp.txt" && $6 >= 2850 {
    print $0, chknull(depname[$NF]), chknull(citycode[depcity[$NF]])
}


$ awk -f j3.awk citycode.txt dept.txt emp.txt 
7839|BLAKE|PRESIDENT||17-NOV-81|5000||10|ACCOUNTING|1123
7698|KING|MANAGER|7839|01-MAY-81|2850||10|ACCOUNTING|1123
7566|JONES|MANAGER|7839|02-APR-81|2975||40|OPERATIONS|1126
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50|NULL|NULL
7902|FORD|ANALYST|7566|03-DEC-81|3000||20|RESEARCH|1124

(Note that I also added the chknull function to print "NULL" instead of empty string to match the desired output from the question, but that's not important for the meat of the question which is how to do joins with more than two files.)

Upvotes: 4

Olivier Dulac
Olivier Dulac

Reputation: 3791

it is straightforward in awk:

  • read the 2 files you need to add in the end of the line (dept.txt and citycode.txt)
  • and add them while parsing the main (emp.txt) file

the code is:

awk -F'|'  -v dptfile="dept.txt" -v citycodefile="citycode.txt" -v from="2850" '
   BEGIN { OFS=FS; 
           rem="build 2 arrays, dpt[] associates number with department,";
           rem="and dptcity[] associate same number with department city";
           while ((getline line<dptfile) > 0) {
              split(line,a,OFS);dpt[a[1]]=a[2]; dptcity[a[1]]=a[3]
           }
           close(dptfile)
           rem="build 3rd array, city[cityname] associates a city name to its number";
          while ((getline line<citycodefile)>0) {
              split(line,a,OFS); city[a[2]]=a[1] ; 
           }
           close(citycodefile); 
         }

  ( $6>=from ) { print $0 OFS ($8 in dpt? dpt[$8]:"NULL") OFS (dptcity[$8] in city? city[dptcity[$8]]:"NULL") ;}
  ' emp.txt

Given the input you cited (and borrowing @akshay-hegde good presentation):

Input

$ cat emp.txt
7839|BLAKE|PRESIDENT||17-NOV-81|5000||10
7698|KING|MANAGER|7839|01-MAY-81|2850||10
7782|CLARK|MANAGER|7839|09-JUN-81|2450||10
7566|JONES|MANAGER|7839|02-APR-81|2975||40
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50
7902|FORD|ANALYST|7566|03-DEC-81|3000||20

$ cat dept.txt 
10|ACCOUNTING|NEW YORK
20|RESEARCH|DALLAS
30|SALES|CHICAGO
40|OPERATIONS|BOSTON

$ cat citycode.txt
1123|NEW YORK
1124|DALLAS
1125|CHICAGO
1126|BOSTON
1127|WASHINGTON

Output

$ awk -F'|' -v dptfile="dept.txt" -v citycodefile="citycode.txt" -v from="2850" 'BEGIN { OFS=FS; rem="build 2 arrays, dpt[] associates number with department,";rem="and dptcity[] associate same number with department city"; while ((getline line<dptfile) > 0) { split(line,a,OFS);dpt[a[1]]=a[2]; dptcity[a[1]]=a[3];} ; close(dptfile) ; rem="build 3rd array, city[cityname] associates a city name to its number"; while ((getline line<citycodefile)>0) { split(line,a,OFS); city[a[2]]=a[1] ; }; close(citycodefile); } ( $6>=from ) { print $0 OFS ($8 in dpt? dpt[$8]:"NULL") OFS (dptcity[$8] in city? city[dptcity[$8]]:"NULL") ;}' emp.txt
7839|BLAKE|PRESIDENT||17-NOV-81|5000||10|ACCOUNTING|1123
7698|KING|MANAGER|7839|01-MAY-81|2850||10|ACCOUNTING|1123
7566|JONES|MANAGER|7839|02-APR-81|2975||40|OPERATIONS|1126
7788|SCOTT|ANALYST|7566|19-APR-87|3000||50|NULL|NULL
7902|FORD|ANALYST|7566|03-DEC-81|3000||20|RESEARCH|1124

Upvotes: 1

Related Questions