Regular expressions matching words which contain the pattern but also the pattern plus something else

Question

I have the following problem:

list1=['xyz','xyz2','other_randoms']
list2=['xyz']

I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.

My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):

    all_keys = list(data.keys())
    for i in range(len(all_keys)):
        if all_keys[i]!='Time':
            #print all_keys[i]
            pattern = re.compile(all_keys[i])
            for j in range(len(specie_name_and_initial_values)):
                print re.findall(pattern,specie_name_and_initial_values[j][0])

Variations of the regular expression I have tried include:

pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')

And I've also tried using 'in' as a qualifier (i.e. within a for loop)

Any help would be greatly appreciated. Thanks

Ciaran

----------EDIT------------

To clarify. My current code is below. its used within a class/method like structure.

def    calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
    copasi_tool = MineParamEstTools() 
    data=pandas.io.excel.read_excel(xlsx_data_file,header=0) 
    #uses custom class and method to get the list of lists from a file
    specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
    if time=='minutes':
        data['Time']=data['Time']*60
    elif time=='hour':
        data['Time']=data['Time']*3600
    elif time=='seconds':
        print 'Time is already in seconds.'
    else:
        print 'Not a valid time unit'
    all_keys = list(data.keys())
    species=[]
    for i in range(len(specie_name_and_initial_values)):
        species.append(specie_name_and_initial_values[i][0])
    for i in range(len(all_keys)):
        for j in range(len(specie_name_and_initial_values)):
            if all_keys[i] in species[j]:
                print all_keys[i]

The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.

I'm most likely over complicating this. Do you have a better solution?

thanks

----------edit 2 ---------------

Okay, below are my variables

all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])

species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])

Padraic Cunningham · Accepted Answer

You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:

list1=['xyz','xyz2','other_randoms']
list2=['xyz']

print(set(list2).intersection(list1))
set(['xyz'])

Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.

You can also rewrite your own code a lot more succinctly, :

for key in data:
    if key != 'Time':
        pattern = re.compile(val)
        for name, _ in specie_name_and_initial_values:
            print re.findall(pattern, name)

Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:

all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])

specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])

specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)

print(all_keys.intersection(specie_name_and_initial_values))

Which outputs:

set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])

FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Regular expressions matching words which contain the pattern but also the pattern plus something else

Answers (1)

Related Questions