O.rka
O.rka

Reputation: 30747

Scan through txt, append certain data to an empty list in Python

I have a text file that I am reading in python . I'm trying to extract certain elements from the text file that follow keywords to append them into empty lists . The file looks like this:

enter image description here

so I want to make two empty lists

Is there anyway to scan for a certain word and when the word is found, take the word that is after it [separated by a '\t'] ?

I need it to create a dictionary of the Sequence Name to be translated to the taxonomic data .

I know i will need an empty list to append the names to:

seq_names=[ ]

a second list to put the taxonomy lists into

taxonomy=[ ]

and a 3rd list that will be reset after every iteration

temp = [ ]

I'm sure it can be done in Biopython but i'm working on my python skills

Upvotes: 3

Views: 3304

Answers (2)

MDT
MDT

Reputation: 471

Yes there is a way.

You can split the string which you get from reading the file into an array using the inbuilt function split. From this you can find the index of the word you are looking for and then using this index plus one to get the word after it. For example using a text file called test.text that looks like so (the formatting is a bit weird because SO doesn't seem to like hard tabs).

one two three   four    five    six seven   eight   nine

The following code

f = open('test.txt','r')

string = f.read()

words = string.split('\t')
ind = words.index('seven')
desired = words[ind+1]

will return desired as 'eight'

Edit: To return every following word in the list

f = open('test.txt','r')

string = f.read()
words = string.split('\t')

desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]

This is using list comprehensions. It enumerates the list of words and if the word is what you are looking for includes the word at the next index in the list.

Edit2: To split it on both new lines and tabs you can use regular expressions

import re
f = open('testtest.txt','r')

string = f.read()

words = re.split('\t|\n',string)

desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]

Upvotes: 2

gfortune
gfortune

Reputation: 2629

It sounds like you might want a dictionary indexed by sequence name. For instance,

my_data = {
           'some_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species],
           'some_other_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species]
          }

Then, you'd just access my_data['some_sequence'] to pull up the data about that sequence.

To populate your data structure, I would just loop over the lines of the files, .split('\t') to break them into "columns" and then do something like my_data[the_row[0]] = [the_row[10], the_row[11], the_row[13]...] to load the row into the dictionary.

So,

for row in inp_file.readlines():
    row = row.split('\t')
    my_data[row[0]] = [row[10], row[11], row[13], ...]

Upvotes: 1

Related Questions