Reputation: 7
Okay, so I'm pretty new to this sort of thing, so please bear with me.
I have two files:
search_results_accesions.txt
is a list of identifiers, one per line. It looks like this (note that not all of the identifiers will start with "NP_"):
$ more search_results_accessions.txt
NP_000020.1
NP_000026.2
NP_000027.2
NP_000029.2
NP_000034.1
NP_000042.3
NP_000056.2
NP_000063.2
NP_000065.1
NP_000068.1
NP_000088.3
NP_000112.1
NP_000117.1
NP_000147.1
NP_000156.1
NP_000167.1
NP_000205.1
NP_000228.1
NP_000241.1
NP_000305.3
NP_000347.2
NP_000354.4
NP_000370.2
prot.accession2taxid.txt
is a file which lists each of the identifiers (and many, many more which are not in my list), and gives the corresponding taxid
. Here is what that looks like (the third column contains the taxid
s):
$ more prot.accession2taxid
accession accession.version taxid gi
APZ74649 APZ74649.1 36984 1137646701
AQT41667 AQT41667.1 1686310 1150388099
WP_080502060 WP_080502060.1 95486 1169627919
ASF53620 ASF53620.1 492670 1211447116
ASF53621 ASF53621.1 492670 1211447117
ASF53622 ASF53622.1 492670 1211447118
ASF53623 ASF53623.1 492670 1211447119
ASF53624 ASF53624.1 492670 1211447120
ASF53625 ASF53625.1 492670 1211447121
ASF53626 ASF53626.1 492670 1211447122
ASF53627 ASF53627.1 492670 1211447123
ASF53628 ASF53628.1 492670 1211447124
ASF53629 ASF53629.1 492670 1211447125
ASF53630 ASF53630.1 492670 1211447126
ASF53631 ASF53631.1 492670 1211447127
ASF53632 ASF53632.1 492670 1211447128
ASF53633 ASF53633.1 492670 1211447129
APZ74650 APZ74650.1 36984 1137646703
APZ74651 APZ74651.1 36984 1137646705
APZ74652 APZ74652.1 36984 1137646707
APZ74653 APZ74653.1 36984 1137646709
APZ74654 APZ74654.1 36984 1137646711
Fields are tab-separated.
I need to get the taxid
for each accession
in my searchresults_accessions.txt
file. I'm on a Unix system and would prefer to use command line or Python, if at all possible.
Upvotes: 0
Views: 90
Reputation: 18707
Here's a solution with awk
(you did say command line or Python):
awk 'NR==FNR {ids[$1]=1} NR>FNR && ($1 in ids) {print $1 "\t" $3}' accessions taxids
Explanation:
accessions
file, then taxids
NR==FNR
), we add values from the first column to the associative map ids
Upvotes: 0
Reputation: 15443
Here is a solution using python and the pandas
module.
I made a few modifications to your files for this to work (added a column name at the top of the first file and replaced multiple tabs by single tabs in the second file). Assuming that you have the following files file1.txt
:
accession.version
NP_000020.1
NP_000026.2
NP_000027.2
NP_000029.2
NP_000034.1
NP_000042.3
NP_000056.2
NP_000063.2
NP_000065.1
NP_000068.1
NP_000088.3
NP_000112.1
NP_000117.1
NP_000147.1
NP_000156.1
NP_000167.1
NP_000205.1
NP_000228.1
NP_000241.1
NP_000305.3
NP_000347.2
NP_000354.4
NP_000370.2
and file2.txt
:
accession accession.version taxid gi
APZ74649 APZ74649.1 36984 1137646701
AQT41667 AQT41667.1 1686310 1150388099
WP_080502060 WP_080502060.1 95486 1169627919
ASF53620 ASF53620.1 492670 1211447116
ASF53621 ASF53621.1 492670 1211447117
ASF53622 ASF53622.1 492670 1211447118
ASF53623 ASF53623.1 492670 1211447119
ASF53624 ASF53624.1 492670 1211447120
ASF53625 ASF53625.1 492670 1211447121
ASF53626 ASF53626.1 492670 1211447122
ASF53627 ASF53627.1 492670 1211447123
ASF53628 ASF53628.1 492670 1211447124
NP_000088 NP_000088.3 62163 3543665822
ASF53629 ASF53629.1 492670 1211447125
ASF53630 ASF53630.1 492670 1211447126
ASF53631 ASF53631.1 492670 1211447127
ASF53632 ASF53632.1 492670 1211447128
ASF53633 ASF53633.1 492670 1211447129
APZ74650 APZ74650.1 36984 1137646703
APZ74651 APZ74651.1 36984 1137646705
APZ74652 APZ74652.1 36984 1137646707
APZ74653 APZ74653.1 36984 1137646709
APZ74654 APZ74654.1 36984 1137646711
NP_000117 NP_000117.1 65683 3543634522
You can do the following:
import pandas as pd
df1 = pd.read_csv('file1.txt', delimiter='\t')
df2 = pd.read_csv('file2.txt', delimiter='\t')
df = df1.merge(df2)
# accession.version accession taxid gi
# 0 NP_000088.3 NP_000088 62163 3543665822
# 1 NP_000117.1 NP_000117 65683 3543634522
If you are only interested in the taxid:
taxid = df.taxid
# 0 62163
# 1 65683
# Name: taxid, dtype: int64
Upvotes: 1