Reputation: 4098
I have a text file with the following structure
ID,operator,a,b,c,d,true
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3
WCBP12238,J1,78.6,79.0,56.2,82.1,84.1
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1
Each ID
corresponds to a dataset which is analysed by an operator several times. i.e J1
and J2
are the first and second attempt by operator J. The measures a
, b
, c
and d
use 4 slightly different algorithms to measure a value whose true value lies in the column true
What I would like to do is to create 3 new text files comparing the results for J1
vs J2
, S1
vs S2
and J1
vs S1
. Example output for J1
vs J2
:
ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3
where a1
is measurement a
for J1
, etc.
Another example is for S1
vs S2
:
ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1
WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3
The IDs will not be in alphanumerical order nor will the operators be clustered for the same ID. I'm not certain how best to approach this task - using linux tools or a scripting language like perl/python.
My initial attempt using linux quickly hit a brick wall
First find all unique IDs (sorted)
awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids
Loop through these IDs and sort J1
, J2
:
foreach i (`more unique_ids`)
grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2
end
This gives me the data sorted
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,80.4
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12238,J1,78.6,79.0,56.2,82.1,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,75.3
WCBP12239,J2,66.6,72.9,79.5,76.6,75.3
I'm not sure how to rearrange this data to get the desired structure. I tried adding an additional pipe to awk
in the foreach
loop awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'
Any ideas? I'm sure this can be done in a less cumbersome manner using awk
, although it may be better using a proper scripting language.
Upvotes: 3
Views: 282
Reputation: 597
Any problem that awk
or sed
can solve, there is no doubt that python
, perl
, java
, go
, c++
, c
can too. However, it is not necessary to write a complete program in any of them.
Use awk
in one liner
VERSION 1
For the most use cases, I think the VERSION 1 is good enough.
tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary
sort -t, -k 1,1 |
awk -F ',+' -v OFS=, '$2==x{id=$1;a=$3;b=$4;c=$5;d=$6} id==$1 && $2==y{$3=a","$3; $4=b","$4; $5=c","$5; $6=d","$6; $2=""; $0=$0; $1=$1; print}' \
x=J1 y=S1
Just replace the value of the argument x
and y
with what you like.
Please note the value of x
and y
must follow the alphabet order, e.g., x=J1 y=S1
is OK, but x=S1 y=J1
doesn't work.
VERSION 2
The limitation mentioned in VERSION 1 that you have to specify the x
and y
in alphabet order is removed. Like, x=S1 y=J1
is OK now.
tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary
sort -t, -k 1,1 |
awk -F ',+' -v OFS=, 'id!=$1 && ($2==x||$2==y){z=$2==x?y:x; id=$1; a=$3;b=$4;c=$5;d=$6} id==$1 && $2==z{$3=a","$3;$4=b","$4;$5=c","$5;$6=d","$6; $2=""; $0=$0; $1=$1; print}' \
x=S1 y=J1
However, the data of J1
is still put before the data of S1
, which means the column a1
in the resulting output is always the column a
of J1
in the input file, and a2
in the resulting output is always the column a
of S1
in the input file.
VERSION 3
The limitation mentioned in the VERSION 2 is removed. Now with x=S1 y=J1
, the output column a1
would be the input column a
of S1
, and the a2
would be the a
of J1
.
tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary
sort -t, -k 1,1 |
awk -F ',+' -v OFS=, 'id!=$1 && ($2==x||$2==y){z=$2==x?y:x; id=$1; a=$3;b=$4;c=$5;d=$6} id==$1 && $2==z{if (z==y) {$3=a","$3;$4=b","$4;$5=c","$5;$6=d","$6} else {$3=$3","a;$4=$4","b;$5=$5","c;$6=$6","d} $2=""; $0=$0; $1=$1; print}' \
x=S1 y=J1
Upvotes: 0
Reputation: 6573
I didn't use Text::CSV like TLP did. If you needed to you could but for this example, I thought since there were no embedded commas in the fields, I did a simple split on ','. Also, the true fields from both operators are listed (instead of just 1) as I thought the special case of the last value complicates the solution.
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw/ mesh /;
my %data;
while (<DATA>) {
chomp;
my ($id, $op, @vals) = split /,/;
$data{$id}{$op} = \@vals;
}
my @ops = ([qw/J1 J2/], [qw/S1 S2/], [qw/J1 S1/]);
for my $id (sort keys %data) {
for my $comb (@ops) {
open my $fh, ">>", "@$comb.txt" or die $!;
my $a1 = $data{$id}{ $comb->[0] };
my $a2 = $data{$id}{ $comb->[1] };
print $fh join(",", $id, mesh(@$a1, @$a2)), "\n";
close $fh or die $!;
}
}
__DATA__
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3
WCBP12239,J1,78.6,79.0,56.2,82.1,82.1
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1
WCBP12238,J1,70.9,71.3,66.0,73.7,84.1
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1
The output files produced are below
J1 J2.txt
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3,75.3
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1,82.1
WCBP12238,70.9,75.1,71.3,75.2,66.0,54.3,73.7,76.4,84.1,84.1
WCBP12239,78.6,66.6,79.0,72.9,56.2,79.5,82.1,76.6,82.1,82.1
S1 S2.txt
WCBP12234,81.8,66.6,82.7,67.9,67.0,53.0,87.5,70.7,75.3,75.3
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1,82.1
WCBP12238,65.9,72.7,66.0,73.2,40.2,52.6,66.5,73.9,84.1,84.1
WCBP12239,86.6,86.0,87.8,86.9,23.0,62.3,23.0,89.7,82.1,82.1
J1 S1.txt
WCBP12234,63.7,81.8,67.7,82.7,72.2,67.0,71.6,87.5,75.3,75.3
WCBP12236,75.7,77.2,80.6,81.5,65.9,69.4,83.2,84.1,82.1,82.1
WCBP12238,70.9,65.9,71.3,66.0,66.0,40.2,73.7,66.5,84.1,84.1
WCBP12239,78.6,86.6,79.0,87.8,56.2,23.0,82.1,23.0,82.1,82.1
Update: To get only 1 true value, the for loop could be written like this:
for my $id (sort keys %data) {
for my $comb (@ops) {
local $" = '';
open my $fh, ">>", "@$comb.txt" or die $!;
my $a1 = $data{$id}{ $comb->[0] };
my $a2 = $data{$id}{ $comb->[1] };
pop @$a2;
my @mesh = grep defined, mesh(@$a1, @$a2);
print $fh join(",", $id, @mesh), "\n";
close $fh or die $!;
}
}
Update: Added 'defined' for test in grep expr. as it is the proper way (instead of just testing '$_', which possibly could be 0 and wrongly excluded for the list by grep).
Upvotes: 1
Reputation: 41
Python Way:
import os,sys, re, itertools
info=["WCBP12236,J1,75.7,80.6,65.9,83.2,82.1",
"WCBP12236,J2,76.3,79.6,61.7,81.9,82.1",
"WCBP12236,S1,77.2,81.5,69.4,84.1,82.1",
"WCBP12236,S2,68.0,68.0,53.2,68.5,82.1",
"WCBP12234,J1,63.7,67.7,72.2,71.6,75.3",
"WCBP12234,J2,68.6,68.4,41.4,68.9,80.4",
"WCBP12234,S1,81.8,82.7,67.0,87.5,75.3",
"WCBP12234,S2,66.6,67.9,53.0,70.7,72.7",
"WCBP12238,J1,78.6,79.0,56.2,82.1,82.1",
"WCBP12239,J2,66.6,72.9,79.5,76.6,75.3",
"WCBP12239,S1,86.6,87.8,23.0,23.0,82.1",
"WCBP12239,S2,86.0,86.9,62.3,89.7,82.1",
"WCBP12239,J1,70.9,71.3,66.0,73.7,75.3",
"WCBP12238,J2,75.1,75.2,54.3,76.4,82.1",
"WCBP12238,S1,65.9,66.0,40.2,66.5,80.4",
"WCBP12238,S2,72.7,73.2,52.6,73.9,72.7" ]
def extract_data(operator_1, operator_2):
operator_index=1
id_index=0
data={}
result=[]
ret=[]
for line in info:
conv_list=line.split(",")
if len(conv_list) > operator_index and ((operator_1.strip().upper() == conv_list[operator_index].strip().upper()) or (operator_2.strip().upper() == conv_list[operator_index].strip().upper()) ):
if data.has_key(conv_list[id_index]):
iters = [iter(conv_list[int(operator_index)+1:]), iter(data[conv_list[id_index]])]
data[conv_list[id_index]]=list(it.next() for it in itertools.cycle(iters))
continue
data[conv_list[id_index]]=conv_list[int(operator_index)+1:]
return data
ret=extract_data("j1", "s2")
print ret
O/P:
{'WCBP12239': ['70.9', '86.0', '71.3', '86.9', '66.0', '62.3', '73.7', '89.7', '75.3', '82.1'], 'WCBP12238': ['72.7', '78.6', '73.2', '79.0', '52.6', '56.2', '73.9', '82.1', '72.7', '82.1'], 'WCBP12234': ['66.6', '63.7', '67.9', '67.7', '53.0', '72.2', '70.7', '71.6', '72.7', '75.3'], 'WCBP12236': ['68.0', '75.7', '68.0', '80.6', '53.2', '65.9', '68.5', '83.2', '82.1', '82.1']}
Upvotes: 1
Reputation: 67910
You can use the Perl csv module Text::CSV to extract the fields, and then store them in a hash, where ID is the main key, the second field is the secondary key and all the fields are stored as the value. It should then be trivial to do whatever comparisons you want. If you want to retain the original order of your lines, you can use an array inside the first loop.
use strict;
use warnings;
use Text::CSV;
my %data;
my $csv = Text::CSV->new({
binary => 1, # safety precaution
eol => $/, # important when using $csv->print()
});
while ( my $row = $csv->getline(*ARGV) ) {
my ($id, $J) = @$row; # first two fields
$data{$id}{$J} = $row; # store line
}
Upvotes: 4