Reputation: 83
I have a reference genome containing the following headers (lines starting with >) that I would like to be renamed to simply the digit/letter of the chromosomes. I would like a sed statement to do this systematic replacement, but I am new to sed. Elsewhere in the file are additional headers that should be unchanged, and the genetic sequences between the headers should remain unchanged.
>ST078050.1 Ovis aries is a sheep chromosome 1, whole genome shotgun sequence
>ST078051.1 Ovis aries is a sheep chromosome 2, whole genome shotgun sequence
>ST078052.1 Ovis aries is a sheep chromosome 3, whole genome shotgun sequence
>ST078053.1 Ovis aries is a sheep chromosome 4, whole genome shotgun sequence
>ST078054.1 Ovis aries is a sheep chromosome 5, whole genome shotgun sequence
>ST078055.1 Ovis aries is a sheep chromosome 6, whole genome shotgun sequence
>ST078056.1 Ovis aries is a sheep chromosome 7, whole genome shotgun sequence
>ST078057.1 Ovis aries is a sheep chromosome 8, whole genome shotgun sequence
>ST078058.1 Ovis aries is a sheep chromosome 9, whole genome shotgun sequence
>ST078059.1 Ovis aries is a sheep chromosome 10, whole genome shotgun sequence
>ST078079.1 Ovis aries is a sheep chromosome X, whole genome shotgun sequence
>ST078080.1 Ovis aries is a sheep chromosome Y, whole genome shotgun sequence
Output should be:
>1
>2
>3
>4
>5
>6
>7
>8
>9
>10
>X
>Y
I tried the following, but it's not right.
sed 's/^.*\(chromosome.*,\).*$/\1/' file
Thank you!
Upvotes: 0
Views: 1099
Reputation: 626758
You can use
sed -E 's/^>.*chromosome ([[:alnum:]]+),.*$/>\1/' file > newfile
See the online demo.
Details
-E
- enables POSIX ERE syntax^>.*chromosome ([[:alnum:]]+),.*$
- find start of string (^
), then >
, any text (.*
), chromosome
word, a space, then captures into Group 1 any one or more alphanumeric chars, then matches a comma and the rest of the string>\1
- replaces the matched line (here, it is a line) with the >
and the contents of Group 1.If you need to replace the same file contents use
sed -i -E 's/^>.*chromosome ([[:alnum:]]+),.*$/>\1/' file # GNU sed
sed -i '' -E 's/^>.*chromosome ([[:alnum:]]+),.*$/>\1/' file # FreeBSD sed
sed 's/^>.*chromosome \([[:alnum:]]*\),.*$/>\1/' file > tmp && mv tmp file # any sed, POSIX BRE syntax
Upvotes: 0
Reputation: 133458
Could you please try following, written and tested with shown samples in GNU awk
.
awk '
match($0,/chromosome [^,]*/){
print substr($0,1,1) substr($0,RSTART+11,RLENGTH-11)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/chromosome [^,]*/){ ##Using match function to match regex chromosome till comma comes here.
print substr($0,1,1) substr($0,RSTART+11,RLENGTH-11)
##Printing sub string to print 1st character then substring of matched regex removing chromosome from it.
}
' Input_file ##Mentioning Input_file here.
Once you are happy with results shown above command try following command to save output into Input_file itself.
awk '
match($0,/chromosome [^,]*/){
print substr($0,1,1) substr($0,RSTART+11,RLENGTH-11)
}
' Input_file > temp && mv temp Input_file
Upvotes: 3
Reputation: 67467
another sed
$ sed -E '/chromosome/s/^>.* (.+),.*/>\1/' file
>1
>2
>3
>4
>5
>6
>7
>8
>9
>10
>X
>Y
for the lines containing chromosome find the chars before the comma and replace the record with that token keeping the initial >
sign.
Upvotes: 1
Reputation: 26471
Assuming that the above are just some headers of actual fasta files, and the remaining sequence is still in the files, then the following solutions will do the job:
$ sed '/^>/{s/,.*//;s/^.* />/}' file.fasta
$ awk '/^>/{sub(/,.*$/,"");$0=">"$NF}1' file.fasta
Both methods do exactly the same. In the line that starts with a >
, remove the string starting with a ,
till the end and replace everything upto the last space with a >
. The latter is done in awk by simple calling the last field.
Upvotes: 2
Reputation: 88583
With GNU sed, regex, and back-reference:
sed -E 's/(.).* ([^ ]+),.*/\1\2/' file
Upvotes: 1