CiaranWelsh
CiaranWelsh

Reputation: 7681

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:

list1=['xyz','xyz2','other_randoms']
list2=['xyz']

I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.

My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):

    all_keys = list(data.keys())
    for i in range(len(all_keys)):
        if all_keys[i]!='Time':
            #print all_keys[i]
            pattern = re.compile(all_keys[i])
            for j in range(len(specie_name_and_initial_values)):
                print re.findall(pattern,specie_name_and_initial_values[j][0])

Variations of the regular expression I have tried include:

pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')

And I've also tried using 'in' as a qualifier (i.e. within a for loop)

Any help would be greatly appreciated. Thanks

Ciaran

----------EDIT------------

To clarify. My current code is below. its used within a class/method like structure.

def    calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
    copasi_tool = MineParamEstTools() 
    data=pandas.io.excel.read_excel(xlsx_data_file,header=0) 
    #uses custom class and method to get the list of lists from a file
    specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
    if time=='minutes':
        data['Time']=data['Time']*60
    elif time=='hour':
        data['Time']=data['Time']*3600
    elif time=='seconds':
        print 'Time is already in seconds.'
    else:
        print 'Not a valid time unit'
    all_keys = list(data.keys())
    species=[]
    for i in range(len(specie_name_and_initial_values)):
        species.append(specie_name_and_initial_values[i][0])
    for i in range(len(all_keys)):
        for j in range(len(specie_name_and_initial_values)):
            if all_keys[i] in species[j]:
                print all_keys[i]

The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.

I'm most likely over complicating this. Do you have a better solution?

thanks

----------edit 2 ---------------

Okay, below are my variables

all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])

species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])

Upvotes: 0

Views: 81

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:

list1=['xyz','xyz2','other_randoms']
list2=['xyz']

print(set(list2).intersection(list1))
set(['xyz'])

Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.

You can also rewrite your own code a lot more succinctly, :

for key in data:
    if key != 'Time':
        pattern = re.compile(val)
        for name, _ in specie_name_and_initial_values:
            print re.findall(pattern, name)

Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:

all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])

specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])

specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)

print(all_keys.intersection(specie_name_and_initial_values))

Which outputs:

set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])

FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Upvotes: 2

Related Questions