madieke
madieke

Reputation: 155

inefficient code: comparing combining different columns from different files awk or perl?

I have two files and I would like to match column 2 from file1 with column NF from file2. If they match I would like to output the whole line from file2 with, with in addition column 5 from file1 and column 5 from file 1 multiplied with column NF-2 from file 2at the end. The files have different lenghts.

I have the following two file-types:

file1

xx  name1 1 we    freq1
xy  name2 2 wer   freq2
xz  name3 3 werf  freq3

file2

..... value1 cv name1
..... value4 cvb name4
..... value3 cvb name3  
..... value1 vbn name5

desired output

..... value1 cv name1 freq1 (freq1*value1)
..... value3 cvb name3 freq3 (freq3*value3)

I have tried doing this using awk.

awk 'FNR==NR { two[$0]++; next } { for (i in two) { split(i, one); if (one[2] == $NF) print $0, one[5], $(NF-2)*one[5] } }' file1 file2 > file3

It works, but is super inefficient, and makes my comp crash after a while. The files are about 100,000 lines each. Maybe I should use perl or python for this? Or is there a way to read file 2 as well? Thanks!

Upvotes: 1

Views: 586

Answers (3)

mpapec
mpapec

Reputation: 50637

perl -ane'
  BEGIN{ open($f,shift)||die$!; %h = (%h, (split)[1,4]) while <$f> }
  $fq = $h{$F[-1]} or next;
  s|$| sprintf(" %s (%s)", $fq, "$fq*$F[-3]") |e;
  print;
' file1 file2 > file3

Expressed as a standalone script:

#!/usr/bin/perl
use strict; use warnings;

my %frequencies;

my $frequency_file = shift @ARGV;
open my $fh, "<", $frequence_file or die "Can't open $frequency_file: $!";

while (<$fh>) {
  my (undef, $name, undef, undef, $freq) = split;
  $frequencies{$name} = $freq;
}

while (<>) {
  my @fields = split;
  my ($value, $name) = @fields[-3, -1];
  next if not exists $frequencies{$name};
  my $freq = $frequencies{$name};
  print join(" ", @fields, $freq, "($freq * $value)"), "\n";
}

output

..... value1 cv name1 freq1 (freq1*value1)
..... value3 cvb name3   freq3 (freq3*value3)

Remove double qoutes for "$fq*$F[-3]" to do actual multiplication.

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 203368

All you need is:

awk 'NR==FNR{a[$2]=$NF;next} $NF in a{print $0, a[$NF], a[$NF] * $(NF-2)}' file1 file2

e.g.:

$ cat file1
xx  name1 1 we    3
xy  name2 2 wer   5
xz  name3 3 werf  7

$ cat file2
..... 2 cv name1
..... 4 cvb name4
..... 6 cvb name3
..... 8 vbn name5

$ awk 'NR==FNR{a[$2]=$NF;next} $NF in a{print $0, a[$NF], a[$NF] * $(NF-2)}' file1 file2
..... 2 cv name1 3 6
..... 6 cvb name3 7 42

Upvotes: 3

Borodin
Borodin

Reputation: 126722

It sounds like the best way is to read all of the relevant columns (2 and 5) from file1 into a hash, and then just process file2 line by line.

If the hash uses file1 column 2 as a key, then it can simply be indexed with the value of the last column of file2 to see if there was a corresponding value in the other file.

Then all that is necessary is to print the values from the file2 record, plus the calculated values using the data from the hash element.

This program demonstrates. The variable names are a little obscure as you don't give a meaning to the fields in your question, so all I can do is refer to file and column numbers.

I have had to output string*string instead of the actual product, because your data doesn't have numerical values.

use strict;
use warnings;
use autodie;

open my $fh1, '<', 'file1';
my %file1;
while (<$fh1>) {
  my ($f1_2, $f1_5) = (split)[1,4];
  $file1{$f1_2} = $f1_5;
}

open my $fh2, '<', 'file2';
while (<$fh2>) {
  my @f2_rec = split;
  my ($f2_nm2, $f2_n) = @f2_rec[-3,-1];
  if (my $f1_5 = $file1{$f2_n}) {
    print join(' ', @f2_rec, $f1_5, "$f1_5*$f2_nm2"), "\n";
  }
}

output

..... value1 cv name1 freq1 freq1*value1
..... value3 cvb name3 freq3 freq3*value3

Upvotes: 3

Related Questions