everestial
everestial

Reputation: 7255

How to find the match between two lists and write the output based on matches?

I am not sure if I put the question title appropriately. But, I have tried to explain the problem below. Please suggest appropriate title if you can think for this problem.

Say I have two types of list data:

list_headers = ['gene_id', 'gene_name', 'trans_id'] 
# these are the features to be mined from each line of `attri_values`

attri_values = 

['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"']
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"']
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']

I am trying to make a table based on matches of the list in the header and attribute in the attri_values.

output = open('gtf_table', 'w')
output.write('\t'.join(list_headers) + '\n') # this will first write the header

# then I want to read each line
for values in attri_values:
    for list in list_headers:
        if values.startswith(list):
            attr_id = ''.join([x for x in attri_values if list in x])
            attr_id = attr_id.replace('"', '').split(' ')[1]
            output.write('\t' + '\t'.join([attr_id]))

        elif not values.startswith(list):
            attr_id = 'NA'
            output.write('\t' + '\t'.join([attr_id]))

        output.write('\n')

Problem: is that when the matching strings from list of list_headers is found in values of attri_values all works well, but when there is no match there are lots of repeat 'NA'.

Final expected outcome:

gene_id    gene_name    trans_id
scaffold_200001.1    NA    NA
scaffold_200001.1    NA    scaffold_200001.1
scaffold_200002.1    NA    scaffold_200002.1

Post edit: This the problem with how I have written my elif (because for every non-match it will write 'NA'). I tried to move the condition of NA in different way but no success. If I remove the elif it get th output as (NA is lost):

gene_id    gene_name    trans_id
scaffold_200001.1
scaffold_200001.1    scaffold_200001.1
scaffold_200002.1    scaffold_200002.1

Upvotes: 0

Views: 141

Answers (3)

GIZ
GIZ

Reputation: 4633

I managed to write a function that will be helpful to parse your data. I tried to modify the original code you posted, what complicates the matter here is the way you store your data that need to be parsed, anyway I'm not in a position to judge, here's my code:

def searchHeader(title, values):
    """"
    searchHeader(title, values) --> list 

    *Return all the words of strings in an iterable object in which title is a substring, 
    without including title. Else write 'N\A' for strings that title is not a substring.
    Example:
             >>> seq = ['spam and ham', 'spam is awesome', 'Ham is...!', 'eat cake but not pizza']
             >>> searchHeader('spam', attri_values)
             ['and', 'ham', 'is', 'awesome', 'N\\A', 'N\\A'] 
    """
    res = [] 
    for x in values: 
        if title in x: 
            res.append(x)
        else:
            res.append('N\A')                     # If no match found append N\A for every string in values

    res = ' '.join(res)
    # res = res.replace('"', '')                  You can use this for your code or use it after you call the function on res
    res = res.split(' ')
    res = [x for x in res if x != title]          # Remove title string from res
    return  res 

Regular expressions can be handy in such cases too. Parse your data with this function and then format the results to write a table to files. This function uses only one for loop and one list comprehension where in your code you use two nested for loops and one list comprehension.

Pass each header string individually to the function, like in the following:

for title in list_headers: 
    result = searchHeader(title, attri_values)
    ...format as table...
    ...write to file... 

If it's possible, consider moving from a simple list to a dictionary for your attri_values, that way you can group your strings with their headers:

attri_values = {'header': ('data1', 'data2',...)}

In my perspective, this is way better than using lists. Also note, you're overriding list name in your code, this is not a good thing to do, that's because list actually the builtin class that creates lists.

Upvotes: 1

Elmex80s
Elmex80s

Reputation: 3504

My answer using pandas

import pandas as pd

# input data
list_headers = ['gene_id', 'gene_name', 'trans_id']

attri_values = [
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'],
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'],
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']]

# process input data
attri_values_X = [dict([tuple(b.split())[:2] for b in a]) for a in attri_values]

# Create DataFrame with the desired columns
df = pd.DataFrame(attri_values_X, columns=list_headers)

# print dataframe
print df

Output

               gene_id  gene_name             trans_id
0  "scaffold_200001.1"        NaN                  NaN
1  "scaffold_200001.1"        NaN  "scaffold_200001.1"
2  "scaffold_200002.1"        NaN  "scaffold_200002.1"

Without pandas is easy as well. I already gave you attri_values_X, then you are almost there, just remove the keys from the dictionary you do not want.

Upvotes: 1

P.Diddy
P.Diddy

Reputation: 46

python has a find method for strings, which you can use to iterate each list header for each attri_values. Try using this function:

def Get_Match(search_space,search_string):
    start_character = search_space.find(search_string)

    if start_character == -1:
        return "N/A"
    else:
        return search_space[(start_character + len(search_string)):]

for  i in range(len(attri_values_1)):
    for j in range(len(list_headers)):
        print Get_Match(attri_values_1[i],list_headers[j])

Upvotes: 1

Related Questions