Reputation: 30747
I have a text file that I am reading in python . I'm trying to extract certain elements from the text file that follow keywords to append them into empty lists . The file looks like this:
so I want to make two empty lists
1st list will append the sequence names
2nd list will be a list of lists which will include be in the format [Bacteria,Phylum,Class,Order, Family, Genus, Species]
most of the organisms will be Uncultured bacterium . I am trying to add the Uncultured bacterium with the following IDs that are separated by ;
Is there anyway to scan for a certain word and when the word is found, take the word that is after it [separated by a '\t'] ?
I need it to create a dictionary of the Sequence Name to be translated to the taxonomic data .
I know i will need an empty list to append the names to:
seq_names=[ ]
a second list to put the taxonomy lists into
taxonomy=[ ]
and a 3rd list that will be reset after every iteration
temp = [ ]
I'm sure it can be done in Biopython but i'm working on my python skills
Upvotes: 3
Views: 3304
Reputation: 471
Yes there is a way.
You can split the string which you get from reading the file into an array using the inbuilt function split. From this you can find the index of the word you are looking for and then using this index plus one to get the word after it. For example using a text file called test.text that looks like so (the formatting is a bit weird because SO doesn't seem to like hard tabs).
one two three four five six seven eight nine
The following code
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
ind = words.index('seven')
desired = words[ind+1]
will return desired as 'eight'
Edit: To return every following word in the list
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
This is using list comprehensions. It enumerates the list of words and if the word is what you are looking for includes the word at the next index in the list.
Edit2: To split it on both new lines and tabs you can use regular expressions
import re
f = open('testtest.txt','r')
string = f.read()
words = re.split('\t|\n',string)
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
Upvotes: 2
Reputation: 2629
It sounds like you might want a dictionary indexed by sequence name. For instance,
my_data = {
'some_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species],
'some_other_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species]
}
Then, you'd just access my_data['some_sequence']
to pull up the data about that sequence.
To populate your data structure, I would just loop over the lines of the files, .split('\t')
to break them into "columns" and then do something like my_data[the_row[0]] = [the_row[10], the_row[11], the_row[13]...]
to load the row into the dictionary.
So,
for row in inp_file.readlines():
row = row.split('\t')
my_data[row[0]] = [row[10], row[11], row[13], ...]
Upvotes: 1