m_breuer
m_breuer

Reputation: 7

Filter table in text file with IDs from second file

Okay, so I'm pretty new to this sort of thing, so please bear with me.

I have two files:

search_results_accesions.txt is a list of identifiers, one per line. It looks like this (note that not all of the identifiers will start with "NP_"):

$ more search_results_accessions.txt 
NP_000020.1
NP_000026.2
NP_000027.2
NP_000029.2
NP_000034.1
NP_000042.3
NP_000056.2
NP_000063.2
NP_000065.1
NP_000068.1
NP_000088.3
NP_000112.1
NP_000117.1
NP_000147.1
NP_000156.1
NP_000167.1
NP_000205.1
NP_000228.1
NP_000241.1
NP_000305.3
NP_000347.2
NP_000354.4
NP_000370.2

prot.accession2taxid.txt is a file which lists each of the identifiers (and many, many more which are not in my list), and gives the corresponding taxid. Here is what that looks like (the third column contains the taxids):

$ more prot.accession2taxid
accession       accession.version       taxid   gi
APZ74649        APZ74649.1      36984   1137646701
AQT41667        AQT41667.1      1686310 1150388099
WP_080502060    WP_080502060.1  95486   1169627919
ASF53620        ASF53620.1      492670  1211447116
ASF53621        ASF53621.1      492670  1211447117
ASF53622        ASF53622.1      492670  1211447118
ASF53623        ASF53623.1      492670  1211447119
ASF53624        ASF53624.1      492670  1211447120
ASF53625        ASF53625.1      492670  1211447121
ASF53626        ASF53626.1      492670  1211447122
ASF53627        ASF53627.1      492670  1211447123
ASF53628        ASF53628.1      492670  1211447124
ASF53629        ASF53629.1      492670  1211447125
ASF53630        ASF53630.1      492670  1211447126
ASF53631        ASF53631.1      492670  1211447127
ASF53632        ASF53632.1      492670  1211447128
ASF53633        ASF53633.1      492670  1211447129
APZ74650        APZ74650.1      36984   1137646703
APZ74651        APZ74651.1      36984   1137646705
APZ74652        APZ74652.1      36984   1137646707
APZ74653        APZ74653.1      36984   1137646709
APZ74654        APZ74654.1      36984   1137646711

Fields are tab-separated.

I need to get the taxid for each accession in my searchresults_accessions.txt file. I'm on a Unix system and would prefer to use command line or Python, if at all possible.

Upvotes: 0

Views: 90

Answers (2)

randomir
randomir

Reputation: 18707

Here's a solution with awk (you did say command line or Python):

awk 'NR==FNR {ids[$1]=1} NR>FNR && ($1 in ids) {print $1 "\t" $3}' accessions taxids

Explanation:

  • we split input lines using the default separators (space or tab)
  • first we read the accessions file, then taxids
  • for the lines in the first file (while total number of records read is equal to the number of records from current file, NR==FNR), we add values from the first column to the associative map ids
  • for the lines in the second file, we print the first and the third field, separated by a tab character, but only if the value in the first field is present in our map of accession ids

Upvotes: 0

Julien Spronck
Julien Spronck

Reputation: 15443

Here is a solution using python and the pandas module.

I made a few modifications to your files for this to work (added a column name at the top of the first file and replaced multiple tabs by single tabs in the second file). Assuming that you have the following files file1.txt:

accession.version
NP_000020.1
NP_000026.2
NP_000027.2
NP_000029.2
NP_000034.1
NP_000042.3
NP_000056.2
NP_000063.2
NP_000065.1
NP_000068.1
NP_000088.3
NP_000112.1
NP_000117.1
NP_000147.1
NP_000156.1
NP_000167.1
NP_000205.1
NP_000228.1
NP_000241.1
NP_000305.3
NP_000347.2
NP_000354.4
NP_000370.2

and file2.txt:

accession   accession.version   taxid   gi
APZ74649    APZ74649.1  36984   1137646701
AQT41667    AQT41667.1  1686310 1150388099
WP_080502060    WP_080502060.1  95486   1169627919
ASF53620    ASF53620.1  492670  1211447116
ASF53621    ASF53621.1  492670  1211447117
ASF53622    ASF53622.1  492670  1211447118
ASF53623    ASF53623.1  492670  1211447119
ASF53624    ASF53624.1  492670  1211447120
ASF53625    ASF53625.1  492670  1211447121
ASF53626    ASF53626.1  492670  1211447122
ASF53627    ASF53627.1  492670  1211447123
ASF53628    ASF53628.1  492670  1211447124
NP_000088   NP_000088.3 62163   3543665822
ASF53629    ASF53629.1  492670  1211447125
ASF53630    ASF53630.1  492670  1211447126
ASF53631    ASF53631.1  492670  1211447127
ASF53632    ASF53632.1  492670  1211447128
ASF53633    ASF53633.1  492670  1211447129
APZ74650    APZ74650.1  36984   1137646703
APZ74651    APZ74651.1  36984   1137646705
APZ74652    APZ74652.1  36984   1137646707
APZ74653    APZ74653.1  36984   1137646709
APZ74654    APZ74654.1  36984   1137646711
NP_000117   NP_000117.1 65683   3543634522

You can do the following:

import pandas as pd
df1 = pd.read_csv('file1.txt', delimiter='\t')
df2 = pd.read_csv('file2.txt', delimiter='\t')
df = df1.merge(df2)

#      accession.version  accession  taxid          gi
# 0       NP_000088.3  NP_000088  62163  3543665822
# 1       NP_000117.1  NP_000117  65683  3543634522

If you are only interested in the taxid:

taxid = df.taxid

# 0    62163
# 1    65683
# Name: taxid, dtype: int64

Upvotes: 1

Related Questions