Ulrike Resch
Ulrike Resch

Reputation: 21

editing fasta-lines, keeping first (ENSP) and last (Gene-Symbol-Isoform) and adding Uniprot ID

I got a fasta file assembled from RNA-seq data like this:

>ENSP00000493376.2|ENST00000641515.2|ENSG00000186092.7|OTTHUMG00000001094.4|OTTHUMT00000003223.4|OR4F5-201|OR4F5|326
MKKVTAEAISWNESTSETNNSMVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIV
ITVVSDSHLHSPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHF
FGGSEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAVTWGIGFLHSVSQLAFAVHL
LFCGPNEVDSFYCDLPRVIKLACTDTYRLDIMVIANSGVLTVCSFVLLIISYTIILMTIQ
HRPLDKSSKALSTLTAHITVVLLFFGPCVFIYAWPFPIKSLDKFLAVFYSVITPLLNPII
YTLRNKDMKTAIRQLRKWDAHSSVKF
>ENSP00000409316.1|ENST00000426406.4|ENSG00000284733.2|OTTHUMG00000002860.3|OTTHUMT00000007999.3|OR4F29-201|OR4F29|312
MDGENHSVVSEFLFLGLTHSWEIQLLLLVFSSVLYVASITGNILIVFSVTTDPHLHSPMY
FLLASLSFIDLGACSVTSPKMIYDLFRKRKVISFGGCIAQIFFIHVVGGVEMVLLIAMAF
DRYVALCKPLHYLTIMSPRMCLSFLAVAWTLGVSHSLFQLAFLVNLAFCGPNVLDSFYCD
LPRLLRLACTDTYRLQFMVTVNSGFICVGTFFILLISYVFILFTVWKHSSGGSSKALSTL
SAHSTVVLLFFGPPMFVYTRPHPNSQMDKFLAIFDAVLTPFLNPVVYTFRNKEMKAAIKR
VCKQLVIYKRIS

I´s like to keep only first identifier and second last supplemented with GN= like this:

>ENSP00000493376.2|GN=OR4F5-201

Ideally I´d like to add Uniprot-ID to this fasta, but I have no idea- not much familiar with Linux/Python etc...

I tried from cmd line with "get-content" as expected was not able to open the fasta file.

Upvotes: 2

Views: 27

Answers (2)

Steve
Steve

Reputation: 54552

One way, using :

awk -F'|' '{ print /^>/ ? $1 FS "GN=" $(NF-2) : $0 }' file.fa

Or using :

perl -F'\|' -le 'print /^>/ ? "$F[0]|GN=$F[-3]" : $_' file.fa

Both solutions essentially do the same thing: they process each line of the file, splitting each line into fields using the pipe character as a delimiter. They use a ternary operator to check if the line starts with a > character. If true, they print a modified header line as required. Otherwise, they print the whole line unchanged. In the Perl solution, -F implicitly enables both -a (which splits each line into the @F array) and -n (which suppresses automatic printing)1.

If you need to add Uniprot IDs to the output, you could try using the UniProt ID mapping service to build a look-up table of IDs. You may also be able to use Ensembl's BioMart for this.

Upvotes: 2

Timur Shtatland
Timur Shtatland

Reputation: 12425

Use this Perl one-liner:

perl -lpe 'if ( m{^>} ) { @f = split m{\|}, $_; $_ = "$f[0]|GN=$f[-3]"; }' infile > outfile

Output:

>ENSP00000493376.2|GN=OR4F5-201
MKKVTAEAISWNESTSETNNSMVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIV
ITVVSDSHLHSPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHF
FGGSEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAVTWGIGFLHSVSQLAFAVHL
LFCGPNEVDSFYCDLPRVIKLACTDTYRLDIMVIANSGVLTVCSFVLLIISYTIILMTIQ
HRPLDKSSKALSTLTAHITVVLLFFGPCVFIYAWPFPIKSLDKFLAVFYSVITPLLNPII
YTLRNKDMKTAIRQLRKWDAHSSVKF
>ENSP00000409316.1|GN=OR4F29-201
MDGENHSVVSEFLFLGLTHSWEIQLLLLVFSSVLYVASITGNILIVFSVTTDPHLHSPMY
FLLASLSFIDLGACSVTSPKMIYDLFRKRKVISFGGCIAQIFFIHVVGGVEMVLLIAMAF
DRYVALCKPLHYLTIMSPRMCLSFLAVAWTLGVSHSLFQLAFLVNLAFCGPNVLDSFYCD
LPRLLRLACTDTYRLQFMVTVNSGFICVGTFFILLISYVFILFTVWKHSSGGSSKALSTL
SAHSTVVLLFFGPPMFVYTRPHPNSQMDKFLAIFDAVLTPFLNPVVYTFRNKEMKAAIKR
VCKQLVIYKRIS

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.

if ( m{^>} ) { ... } : If the line starts with > (= if the line is a fasta header line), execute the code in ....
@f = split m{\|}, $_; : Split the input line on literal | (which needs to be escaped like so: \|), and store the parts in the array @f.
$f[0] : The first element of the array @f.
$f[-3] : The second to last element of the array @f ($f[-1] is the last element, and - is for counting the indexes from the end).

See also:

Upvotes: 2

Related Questions