hiob
hiob

Reputation: 73

sed: hold pattern and rearrange line

I am not sure if I can do this purely with sed:

I am trying to rearrange lines like this

GF:001,GF:00012,GF:01223<TAB>XXR
GF:001,GF:00012,GF:01223,GF:0666<TAB>XXXR3

to

GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3

Anyone any hints? The cardinality of GF:XXXX is alternating as the length of GF:XXXX is.

I am stuck with sed -n ' '/\(XX.*\)$/' { s/,/\t\1\n/ }' input but I cannot reference to the originally matched pattern in the first place. any ideas? cheers!

Update: I think it is not possible to do this with just using sed. So I used perl to do this:

perl -e 'open(IN, "< file");
while (<IN>) {
    @a = split(/\t/);
    @gos = split(/,/, $a[0]);
    foreach (@gos) {
      print $_."\t".$a[1];
    }
close( IN );' > output

But if anyone knows a way to solve this just with sed please post it here...

Upvotes: 4

Views: 2310

Answers (5)

Ed Morton
Ed Morton

Reputation: 203532

awk -F'[,\t]' '{for (i=1;i<NF;i++) print $i"\t"$NF}' file

Awk reads one line at a time (by default) and splits the line up into fields. I'm using -F to tell awk to separate the line into fields at each comma or tab. NF is the number of fields in the line, $i is the contents of field number i.

Upvotes: 2

Mirage
Mirage

Reputation: 31548

Well it took me 3 hours to do it

sed -re ':a; s/(GF:[0-9]*[^,]*),([^<]*)(<TAB>[A-Z]*)/\1\3\n\2\3/g;ta; ' file.txt

Upvotes: 2

Jonathan Leffler
Jonathan Leffler

Reputation: 753815

It can be done in sed, though I probably would use Perl (or Awk or Python) to do it.

I claim no elegance for this solution, but brute force and ignorance sometimes pays off. I created a file called, unoriginally, sed.script containing:

/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/{
:redo
s/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/\1<TAB>\3@@@@@\2<TAB>\3/
h
s/@@@@@.*//
p
x
s/.*@@@@@//
t redo
d
}

I ran it as:

sed -f sed.script input

where input contained the two lines shown in the question. It produced the output:

GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3

(I took the liberty of deliberately misinterpreting <TAB> to be a 5-character sequence instead of a single tab character; you can easily fix the answer to handle an actual tab character instead.)

Explanation of the sed script:

  • Find lines with more than one occurrence of GF:nnn separated by commas (we do not need to process lines that contain a single such occurrence). Do the rest of the script only on such lines. Anything else is passed through (printed) unchanged.
  • Create a label so we can branch back to it
  • Split the line into 3 remembered parts. The first part is the initial GF information; the second part is any other GF information; the third part is the field after the <TAB>. Replace this with the first field, <TAB>, third field, implausible marker pattern (@@@@@), second field, <TAB>, third field.
  • Copy the modified line to the hold space.
  • Delete the marker pattern to the end.
  • Print.
  • Swap the hold space into the pattern space.
  • Remove everything up to and including the marker pattern.
  • If we've done any work, go back to the redo label.
  • Delete what's left (it was printed already).
  • End of script block.

This is a simple loop that reduces the number of the patterns by one on each iteration.

Upvotes: 7

brandizzi
brandizzi

Reputation: 27050

You can do it straightforwardly with awk:

$ awk '{gsub(/,/, "\t" $NF "\n");print}' input 

In this case, we just replace the comma by a tab concatenated with the last field (NF stores the number of fields of a record; $NF gets the NFth field) concatenated with a newline. Then, print the result.

It can be solved with sed, too, in a way similar but IMHO a bit better than the Jonathan solution (which is pretty sophisticated, I should remark).

sed -n '
:BEGIN
 h
 s/,.*<TAB>/<TAB>/
 p
 x
 s/^[^,]*,//
t BEGIN' input

Here, we define a label in the beginning of the script:

:BEGIN

Then we copy the content of the pattern space to the hold space:

h

Now, we replace everything from the first comma until the tab with only a tab:

 s/,.*<TAB>/<TAB>/

We print the result...

p

...and retrieve the content of the hold space:

x

Since we printed the first line - which contains the first GF:XXX pattern followed by the final XXR pattern - we remove the first GF:XXX pattern from the line:

 s/^[^,]*,//

If a replacement is executed, we branch to the beginning of script:

t BEGIN

And everything is applied again to the same line, except that now this line does not have the first GF:XXX pattern anymore. OTOH, if no replacement is made, then the processing of the current line is done and we do not jump to the beginning anymore.

Upvotes: 3

ssapkota
ssapkota

Reputation: 3302

If you don't strictly want sed, awk is good at doing this:

awk -F'\t|,' '{ i=1; do { printf("%s\t%s\n",$i,$NF); i++;}  while ( i<NF ); }' inputfile

Upvotes: 2

Related Questions