Reputation: 7255
I am not sure if I put the question title appropriately. But, I have tried to explain the problem below. Please suggest appropriate title if you can think for this problem.
Say I have two types of list data:
list_headers = ['gene_id', 'gene_name', 'trans_id']
# these are the features to be mined from each line of `attri_values`
attri_values =
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"']
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"']
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']
I am trying to make a table based on matches of the list in the header
and attribute in the attri_values
.
output = open('gtf_table', 'w')
output.write('\t'.join(list_headers) + '\n') # this will first write the header
# then I want to read each line
for values in attri_values:
for list in list_headers:
if values.startswith(list):
attr_id = ''.join([x for x in attri_values if list in x])
attr_id = attr_id.replace('"', '').split(' ')[1]
output.write('\t' + '\t'.join([attr_id]))
elif not values.startswith(list):
attr_id = 'NA'
output.write('\t' + '\t'.join([attr_id]))
output.write('\n')
Problem: is that when the matching strings from list of list_headers
is found in values of attri_values
all works well, but when there is no match there are lots of repeat 'NA'.
Final expected outcome:
gene_id gene_name trans_id
scaffold_200001.1 NA NA
scaffold_200001.1 NA scaffold_200001.1
scaffold_200002.1 NA scaffold_200002.1
Post edit:
This the problem with how I have written my elif
(because for every non-match it will write 'NA'). I tried to move the condition of NA
in different way but no success. If I remove the elif
it get th output as (NA
is lost):
gene_id gene_name trans_id
scaffold_200001.1
scaffold_200001.1 scaffold_200001.1
scaffold_200002.1 scaffold_200002.1
Upvotes: 0
Views: 141
Reputation: 4633
I managed to write a function that will be helpful to parse your data. I tried to modify the original code you posted, what complicates the matter here is the way you store your data that need to be parsed, anyway I'm not in a position to judge, here's my code:
def searchHeader(title, values):
""""
searchHeader(title, values) --> list
*Return all the words of strings in an iterable object in which title is a substring,
without including title. Else write 'N\A' for strings that title is not a substring.
Example:
>>> seq = ['spam and ham', 'spam is awesome', 'Ham is...!', 'eat cake but not pizza']
>>> searchHeader('spam', attri_values)
['and', 'ham', 'is', 'awesome', 'N\\A', 'N\\A']
"""
res = []
for x in values:
if title in x:
res.append(x)
else:
res.append('N\A') # If no match found append N\A for every string in values
res = ' '.join(res)
# res = res.replace('"', '') You can use this for your code or use it after you call the function on res
res = res.split(' ')
res = [x for x in res if x != title] # Remove title string from res
return res
Regular expressions can be handy in such cases too. Parse your data with this function and then format the results to write a table to files. This function uses only one for
loop and one list comprehension where in your code you use two nested for
loops and one list comprehension.
Pass each header string individually to the function, like in the following:
for title in list_headers:
result = searchHeader(title, attri_values)
...format as table...
...write to file...
If it's possible, consider moving from a simple list to a dictionary for your attri_values
, that way you can group your strings with their headers:
attri_values = {'header': ('data1', 'data2',...)}
In my perspective, this is way better than using lists. Also note, you're overriding list
name in your code, this is not a good thing to do, that's because list
actually the builtin class that creates lists.
Upvotes: 1
Reputation: 3504
My answer using pandas
import pandas as pd
# input data
list_headers = ['gene_id', 'gene_name', 'trans_id']
attri_values = [
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'],
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'],
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']]
# process input data
attri_values_X = [dict([tuple(b.split())[:2] for b in a]) for a in attri_values]
# Create DataFrame with the desired columns
df = pd.DataFrame(attri_values_X, columns=list_headers)
# print dataframe
print df
Output
gene_id gene_name trans_id
0 "scaffold_200001.1" NaN NaN
1 "scaffold_200001.1" NaN "scaffold_200001.1"
2 "scaffold_200002.1" NaN "scaffold_200002.1"
Without pandas is easy as well. I already gave you attri_values_X
, then you are almost there, just remove the keys from the dictionary you do not want.
Upvotes: 1
Reputation: 46
python has a find
method for strings, which you can use to iterate each list header for each attri_values. Try using this function:
def Get_Match(search_space,search_string):
start_character = search_space.find(search_string)
if start_character == -1:
return "N/A"
else:
return search_space[(start_character + len(search_string)):]
for i in range(len(attri_values_1)):
for j in range(len(list_headers)):
print Get_Match(attri_values_1[i],list_headers[j])
Upvotes: 1