Reputation: 73
I am not sure if I can do this purely with sed:
I am trying to rearrange lines like this
GF:001,GF:00012,GF:01223<TAB>XXR
GF:001,GF:00012,GF:01223,GF:0666<TAB>XXXR3
to
GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3
Anyone any hints? The cardinality of GF:XXXX is alternating as the length of GF:XXXX is.
I am stuck with sed -n '
'/\(XX.*\)$/' {
s/,/\t\1\n/
}' input
but I cannot reference to the originally matched pattern in the first place. any ideas? cheers!
Update: I think it is not possible to do this with just using sed. So I used perl to do this:
perl -e 'open(IN, "< file");
while (<IN>) {
@a = split(/\t/);
@gos = split(/,/, $a[0]);
foreach (@gos) {
print $_."\t".$a[1];
}
close( IN );' > output
But if anyone knows a way to solve this just with sed
please post it here...
Upvotes: 4
Views: 2310
Reputation: 203532
awk -F'[,\t]' '{for (i=1;i<NF;i++) print $i"\t"$NF}' file
Awk reads one line at a time (by default) and splits the line up into fields. I'm using -F to tell awk to separate the line into fields at each comma or tab. NF is the number of fields in the line, $i is the contents of field number i.
Upvotes: 2
Reputation: 31548
Well it took me 3 hours to do it
sed -re ':a; s/(GF:[0-9]*[^,]*),([^<]*)(<TAB>[A-Z]*)/\1\3\n\2\3/g;ta; ' file.txt
Upvotes: 2
Reputation: 753815
It can be done in sed
, though I probably would use Perl (or Awk or Python) to do it.
I claim no elegance for this solution, but brute force and ignorance sometimes pays off. I created a file called, unoriginally, sed.script
containing:
/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/{
:redo
s/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/\1<TAB>\3@@@@@\2<TAB>\3/
h
s/@@@@@.*//
p
x
s/.*@@@@@//
t redo
d
}
I ran it as:
sed -f sed.script input
where input
contained the two lines shown in the question. It produced the output:
GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3
(I took the liberty of deliberately misinterpreting <TAB>
to be a 5-character sequence instead of a single tab character; you can easily fix the answer to handle an actual tab character instead.)
Explanation of the sed
script:
GF:nnn
separated by commas (we do not need to process lines that contain a single such occurrence). Do the rest of the script only on such lines. Anything else is passed through (printed) unchanged.<TAB>
. Replace this with the first field, <TAB>
, third field, implausible marker pattern (@@@@@
), second field, <TAB>
, third field.redo
label.This is a simple loop that reduces the number of the patterns by one on each iteration.
Upvotes: 7
Reputation: 27050
You can do it straightforwardly with awk:
$ awk '{gsub(/,/, "\t" $NF "\n");print}' input
In this case, we just replace the comma by a tab concatenated with the last field (NF
stores the number of fields of a record; $NF
gets the NF
th field) concatenated with a newline. Then, print the result.
It can be solved with sed, too, in a way similar but IMHO a bit better than the Jonathan solution (which is pretty sophisticated, I should remark).
sed -n '
:BEGIN
h
s/,.*<TAB>/<TAB>/
p
x
s/^[^,]*,//
t BEGIN' input
Here, we define a label in the beginning of the script:
:BEGIN
Then we copy the content of the pattern space to the hold space:
h
Now, we replace everything from the first comma until the tab with only a tab:
s/,.*<TAB>/<TAB>/
We print the result...
p
...and retrieve the content of the hold space:
x
Since we printed the first line - which contains the first GF:XXX
pattern followed by the final XXR
pattern - we remove the first GF:XXX
pattern from the line:
s/^[^,]*,//
If a replacement is executed, we branch to the beginning of script:
t BEGIN
And everything is applied again to the same line, except that now this line does not have the first GF:XXX
pattern anymore. OTOH, if no replacement is made, then the processing of the current line is done and we do not jump to the beginning anymore.
Upvotes: 3
Reputation: 3302
If you don't strictly want sed, awk is good at doing this:
awk -F'\t|,' '{ i=1; do { printf("%s\t%s\n",$i,$NF); i++;} while ( i<NF ); }' inputfile
Upvotes: 2