Reputation: 83

Trim FASTA headers with sed

I have a reference genome containing the following headers (lines starting with >) that I would like to be renamed to simply the digit/letter of the chromosomes. I would like a sed statement to do this systematic replacement, but I am new to sed. Elsewhere in the file are additional headers that should be unchanged, and the genetic sequences between the headers should remain unchanged.

>ST078050.1 Ovis aries is a sheep chromosome 1, whole genome shotgun sequence
>ST078051.1 Ovis aries is a sheep chromosome 2, whole genome shotgun sequence
>ST078052.1 Ovis aries is a sheep chromosome 3, whole genome shotgun sequence
>ST078053.1 Ovis aries is a sheep chromosome 4, whole genome shotgun sequence
>ST078054.1 Ovis aries is a sheep chromosome 5, whole genome shotgun sequence
>ST078055.1 Ovis aries is a sheep chromosome 6, whole genome shotgun sequence
>ST078056.1 Ovis aries is a sheep chromosome 7, whole genome shotgun sequence
>ST078057.1 Ovis aries is a sheep chromosome 8, whole genome shotgun sequence
>ST078058.1 Ovis aries is a sheep chromosome 9, whole genome shotgun sequence
>ST078059.1 Ovis aries is a sheep chromosome 10, whole genome shotgun sequence
>ST078079.1 Ovis aries is a sheep chromosome X, whole genome shotgun sequence
>ST078080.1 Ovis aries is a sheep chromosome Y, whole genome shotgun sequence

Output should be:

>1
>2
>3
>4
>5
>6
>7
>8
>9
>10
>X
>Y

I tried the following, but it's not right.

sed 's/^.*\(chromosome.*,\).*$/\1/' file

Thank you!

Upvotes: 0

Answers (5)

Wiktor Stribiżew

Reputation: 626758

You can use

sed -E 's/^>.*chromosome ([[:alnum:]]+),.*$/>\1/' file > newfile

See the online demo.

Details

-E - enables POSIX ERE syntax
^>.*chromosome ([[:alnum:]]+),.*$ - find start of string (^), then >, any text (.*), chromosome word, a space, then captures into Group 1 any one or more alphanumeric chars, then matches a comma and the rest of the string
>\1 - replaces the matched line (here, it is a line) with the > and the contents of Group 1.

If you need to replace the same file contents use

sed -i -E 's/^>.*chromosome ([[:alnum:]]+),.*$/>\1/' file                   # GNU sed
sed -i '' -E 's/^>.*chromosome ([[:alnum:]]+),.*$/>\1/' file                # FreeBSD sed
sed 's/^>.*chromosome \([[:alnum:]]*\),.*$/>\1/' file > tmp && mv tmp file  # any sed, POSIX BRE syntax

Upvotes: 0

RavinderSingh13

Reputation: 133458

Could you please try following, written and tested with shown samples in GNU awk.

awk '
match($0,/chromosome [^,]*/){
  print substr($0,1,1) substr($0,RSTART+11,RLENGTH-11)
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk '                              ##Starting awk program from here.
match($0,/chromosome [^,]*/){      ##Using match function to match regex chromosome till comma comes here.
  print substr($0,1,1) substr($0,RSTART+11,RLENGTH-11)
                                   ##Printing sub string to print 1st character then substring of matched regex removing chromosome from it.
}
' Input_file                       ##Mentioning Input_file here.

Once you are happy with results shown above command try following command to save output into Input_file itself.

awk '
match($0,/chromosome [^,]*/){
  print substr($0,1,1) substr($0,RSTART+11,RLENGTH-11)
}
'  Input_file > temp && mv temp Input_file

Upvotes: 3

karakfa

Reputation: 67467

another sed

$ sed -E '/chromosome/s/^>.* (.+),.*/>\1/' file

>1
>2
>3
>4
>5
>6
>7
>8
>9
>10
>X
>Y

for the lines containing chromosome find the chars before the comma and replace the record with that token keeping the initial > sign.

Upvotes: 1

kvantour

Reputation: 26471

Assuming that the above are just some headers of actual fasta files, and the remaining sequence is still in the files, then the following solutions will do the job:

$ sed '/^>/{s/,.*//;s/^.* />/}' file.fasta
$ awk '/^>/{sub(/,.*$/,"");$0=">"$NF}1' file.fasta

Both methods do exactly the same. In the line that starts with a >, remove the string starting with a , till the end and replace everything upto the last space with a >. The latter is done in awk by simple calling the last field.

Upvotes: 2

Cyrus

Reputation: 88583

With GNU sed, regex, and back-reference:

sed -E 's/(.).* ([^ ]+),.*/\1\2/' file

Upvotes: 1

Trim FASTA headers with sed

Answers (5)

Related Questions