spamer
spamer

Reputation: 1

Delete Text between two strings sed, awk

I was searching for a solution, but could not find a proper one.

I want to remove all characters between two strings on every line.

Input is a Fasta-File:

>CAM_P0000101_READ_00457523 /accession=CAM_P0000101_READ_00457523 /xy=2625_3790 /region=2 /run=R_2008_08_11_16_51_31_ /length=253 /sample_id=1309720343513924875 /sample_acc=CAM_P0000101_SMPL_PAPUT2 /sample_name=CAM_P0000101_SMPL_PAPUT2 /site_id_n=CAM_P0000101_SITE_PAPUT2
GTGCCTTCGGGAACCGGGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTGCCAGCACGTAATGGTGGGAACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGCTAGGACAGACGGCTGCAAACCNGCGAGTGGGG
>CAM_P0000101_READ_00460168 /accession=CAM_P0000101_READ_00460168 /xy=2199_0493 /region=2 /run=R_2008_08_11_16_51_31_ /length=233 /sample_id=1309720343513924875 /sample_acc=CAM_P0000101_SMPL_PAPUT2 /sample_name=CAM_P0000101_SMPL_PAPUT2 /site_id_n=CAM_P0000101_SITE_PAPUT2
TTTACCGCGGCTGCTGGCACGAAGTTAGCCGGACCTTATTCTTCGGGTACAGTCATTATCTTTCCCGACAAAAGAGCTTTACAACCCAAGGGCCTTCTTCACTCACGCGGCATCGCTGCATCAGGCTTTCGCCCATTGTGCAAGATTCCCCACTGCTGCCTCCCGTAGGAGTCTGGGCCGTATCTCAGTCCCAGTGTGGCTGATCATCCTCTACAAATCAGCTATTGATTACT

I want to delete all text after first >CAM_P* to /sample_name=* and all after sample_name.*

>CAM_* /sample_name=* only these two things should remain.

all this should be removed :

/accession=CAM_P0000101_READ_00457523 /xy=2625_3790 /region=2 /run=R_2008_08_11_16_51_31_ /length=253 /sample_id=1309720343513924875 /sample_acc=CAM_P0000101_SMPL_PAPUT /site_id_n=CAM_P0000101_SITE_PAPUT2

Could anyone please help me ?

Upvotes: 0

Views: 1657

Answers (2)

karakfa
karakfa

Reputation: 67467

awk to the rescue

awk '{line=""; sep=""; p=q=0; 
      for(i=1;i<=NF;i++) {
          if(!p && $i~/CAM_P/) {
              p=1;
              line=line sep $i;
              sep=FS
          } else if(!q && $i~/sample_name/) {
              q=1;
              line=line sep $i;
              sep=FS
          }
       } 
       print line
      }'

another alternative with grep

grep -o ">CAM_P\w*\|/sample_name=\w*" filename | awk 'ORS=NR%2?FS:RS'

match the two words only and merge back two lines of the output

Upvotes: 2

Beta
Beta

Reputation: 99094

How about this:

sed 's/\(>CAM_P[^ ]*\).*\(\/sample_name=[^ ]*\).*/\1 \2/' filename

Upvotes: 1

Related Questions