Rish
Rish

Reputation: 1

merging 2 csv files using awk

I have 3 CSV files:

Base File(values initialised with 0)

steve tignor ash michael jose sam joshua
0       0     0     0     0     0    0

File 1:

tignor michael  jose
888      9       -2

File 2:

ash joshua
77   66

Output I need:

      steve tignor ash michael jose sam joshua
File1   0    888    0    9      -2   0     0
File2   0     0     77   0       0   0     66

I tried with sorting the files first with awk and then merge with paste but as I have 1000+ columns and having 30 files it just did not work.

Code:

awk -F"," 'NR==1{
  split($0,a,FS);asort(a);
  for(i=1;i<=NF;i++)b[$i]=i
} {
  for(i=1;i<=NF;i++)printf("%s,",$(b[a[i]]));
  print x
}' File1 > 1.csv

awk -F"," 'NR==1{
  split($0,a,FS);asort(a);
  for(i=1;i<=NF;i++)b[$i]=i
} {
  for(i=1;i<=NF;i++)printf("%s,",$(b[a[i]]));
  print x
}' File2 > 2.csv

paste -d"\n" 1.csv 2.csv > merge.csv

Need some assistance here. Thanks in advance.

Upvotes: 1

Views: 384

Answers (1)

lucaslugao
lucaslugao

Reputation: 173

I assumed that you omitted the commas in the files. If you're using space separated files you could just change the separator used in the split function.

awk '
ARGIND==1 && FNR==1{
  split($0, base, ",")
  printf("file,%s\n",$0)
}
ARGIND > 1 && FNR==1{
  split($0, names, ",")
  printf("%s", ARGV[ARGIND])
}
ARGIND > 1 && FNR==2{
  split($0, values, ",")
  for(i in names)
    line[names[i]] = values[i]
  for(i in base){
    if(base[i] in line)
      printf(",%s", line[base[i]])
    else
      printf(",0")
  }
  delete line
  print ""
}
' base.csv file1.csv file2.csv

Example:

file1.csv:

tignor,michael,jose
888,9,-2

file2.csv:

ash,joshua
77,66

and base.csv:

steve,tignor,ash,michael,jose,sam,joshua
0,0,0,0,0,0,0

the output is:

file,steve,tignor,ash,michael,jose,sam,joshua
file1.csv,0,888,0,9,-2,0,0
file2.csv,0,0,77,0,0,0,66

Basically, the script is running in 2 steps:

  • First we read the names from the base.csv and store them into an array.
  • Then, for each file we store the names appearing in its header and try to print one value for each column in the base csv. if we don't have the value corresponding to a column in a particular file we just print 0 instead.

P.S. I made a new POSIX awk compatible version of the script:

awk --posix '
NR==FNR && FNR==1{
  split($0, base, ",")
  printf("file,%s\n",$0)
}
NR>FNR && FNR==1{
  split($0, names, ",")
  printf("%s", FILENAME)
}
NR>FNR && FNR==2{
  split($0, values, ",")
  for(i in names)
    line[names[i]] = values[i]
  for(i in base){
    if(base[i] in line)
      printf(",%s", line[base[i]])
    else
      printf(",0")
  }
  delete line
  print ""
}
' base.csv file1.csv file2.csv

Upvotes: 1

Related Questions