Reputation: 21
I got a fasta file assembled from RNA-seq data like this:
>ENSP00000493376.2|ENST00000641515.2|ENSG00000186092.7|OTTHUMG00000001094.4|OTTHUMT00000003223.4|OR4F5-201|OR4F5|326
MKKVTAEAISWNESTSETNNSMVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIV
ITVVSDSHLHSPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHF
FGGSEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAVTWGIGFLHSVSQLAFAVHL
LFCGPNEVDSFYCDLPRVIKLACTDTYRLDIMVIANSGVLTVCSFVLLIISYTIILMTIQ
HRPLDKSSKALSTLTAHITVVLLFFGPCVFIYAWPFPIKSLDKFLAVFYSVITPLLNPII
YTLRNKDMKTAIRQLRKWDAHSSVKF
>ENSP00000409316.1|ENST00000426406.4|ENSG00000284733.2|OTTHUMG00000002860.3|OTTHUMT00000007999.3|OR4F29-201|OR4F29|312
MDGENHSVVSEFLFLGLTHSWEIQLLLLVFSSVLYVASITGNILIVFSVTTDPHLHSPMY
FLLASLSFIDLGACSVTSPKMIYDLFRKRKVISFGGCIAQIFFIHVVGGVEMVLLIAMAF
DRYVALCKPLHYLTIMSPRMCLSFLAVAWTLGVSHSLFQLAFLVNLAFCGPNVLDSFYCD
LPRLLRLACTDTYRLQFMVTVNSGFICVGTFFILLISYVFILFTVWKHSSGGSSKALSTL
SAHSTVVLLFFGPPMFVYTRPHPNSQMDKFLAIFDAVLTPFLNPVVYTFRNKEMKAAIKR
VCKQLVIYKRIS
I´s like to keep only first identifier and second last supplemented with GN= like this:
>ENSP00000493376.2|GN=OR4F5-201
Ideally I´d like to add Uniprot-ID to this fasta, but I have no idea- not much familiar with Linux/Python etc...
I tried from cmd line with "get-content" as expected was not able to open the fasta file.
Upvotes: 2
Views: 27
Reputation: 54552
One way, using awk:
awk -F'|' '{ print /^>/ ? $1 FS "GN=" $(NF-2) : $0 }' file.fa
Or using perl:
perl -F'\|' -le 'print /^>/ ? "$F[0]|GN=$F[-3]" : $_' file.fa
Both solutions essentially do the same thing: they process each line of the file, splitting each line into fields using the pipe character as a delimiter. They use a ternary operator to check if the line starts with a >
character. If true, they print a modified header line as required. Otherwise, they print the whole line unchanged. In the Perl solution, -F
implicitly enables both -a
(which splits each line into the @F
array) and -n
(which suppresses automatic printing)1.
If you need to add Uniprot IDs to the output, you could try using the UniProt ID mapping service to build a look-up table of IDs. You may also be able to use Ensembl's BioMart for this.
Upvotes: 2
Reputation: 12425
Use this Perl one-liner:
perl -lpe 'if ( m{^>} ) { @f = split m{\|}, $_; $_ = "$f[0]|GN=$f[-3]"; }' infile > outfile
Output:
>ENSP00000493376.2|GN=OR4F5-201
MKKVTAEAISWNESTSETNNSMVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIV
ITVVSDSHLHSPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHF
FGGSEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAVTWGIGFLHSVSQLAFAVHL
LFCGPNEVDSFYCDLPRVIKLACTDTYRLDIMVIANSGVLTVCSFVLLIISYTIILMTIQ
HRPLDKSSKALSTLTAHITVVLLFFGPCVFIYAWPFPIKSLDKFLAVFYSVITPLLNPII
YTLRNKDMKTAIRQLRKWDAHSSVKF
>ENSP00000409316.1|GN=OR4F29-201
MDGENHSVVSEFLFLGLTHSWEIQLLLLVFSSVLYVASITGNILIVFSVTTDPHLHSPMY
FLLASLSFIDLGACSVTSPKMIYDLFRKRKVISFGGCIAQIFFIHVVGGVEMVLLIAMAF
DRYVALCKPLHYLTIMSPRMCLSFLAVAWTLGVSHSLFQLAFLVNLAFCGPNVLDSFYCD
LPRLLRLACTDTYRLQFMVTVNSGFICVGTFFILLISYVFILFTVWKHSSGGSSKALSTL
SAHSTVVLLFFGPPMFVYTRPHPNSQMDKFLAIFDAVLTPFLNPVVYTFRNKEMKAAIKR
VCKQLVIYKRIS
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-p
: Loop over the input one line at a time, assigning it to $_
by default. Add print $_
after each loop iteration.
-l
: Strip the input line separator ("\n"
on *NIX by default) before executing the code in-line, and append it when printing.
if ( m{^>} ) { ... }
: If the line starts with >
(= if the line is a fasta header line), execute the code in ...
.
@f = split m{\|}, $_;
: Split the input line on literal |
(which needs to be escaped like so: \|
), and store the parts in the array @f
.
$f[0]
: The first element of the array @f
.
$f[-3]
: The second to last element of the array @f
($f[-1]
is the last element, and -
is for counting the indexes from the end).
perldoc perlrun
: how to execute the Perl interpreter: command line switchesperldoc perlre
: Perl regular expressions (regexes)perldoc perlrequick
: Perl regular expressions quick startUpvotes: 2