Reputation: 135
I am currently working to make a dictionary with a tuple of names as keys and a float as the value of the form {(nameA, nameB) : datavalue, (nameB, nameC) : datavalue ,...}
The values data is from a matrix I have made into a pandas DataFrame with the names as both the index and column labels. I have created an ordered list of the keys for my final dictionary called keys
with the function createDictionaryKeys()
. The issue I have is that not all the names from this list appear in my data matrix. I want to only include the names do appear in the data matrix in my final dictionary.
How can I do this search avoiding the slow linear for loop? I created a dictionary that has the name as key and a value of 1 if it should be included and 0 otherwise as well. It has the form {nameA : 1, nameB: 0, ... }
and is called allow_dict
. I was hoping to use this to do some sort of hash search.
def createDictionary( keynamefile, seperator, datamatrix, matrixsep):
import pandas as pd
keys = createDictionaryKeys(keynamefile, seperator)
final_dict = {}
data_df = pd.read_csv(open(datamatrix), sep = matrixsep)
pd.set_option("display.max_rows", len(data_df))
df_indices = list(data_df.index.values)
df_cols = list(data_df.columns.values)[1:]
for i in df_indices:
data_df = data_df.rename(index = {i:df_cols[i]})
data_df = data_df.drop("Unnamed: 0", 1)
allow_dict = descriminatePromoters( HARDCODEDFILENAME, SEP, THRESHOLD )
#print ( item for item in df_cols if allow_dict[item] == 0 ).next()
present = [ x for x in keys if x[0] in df_cols and x[1] in df_cols]
for i in present:
final_dict[i] = final_df.loc[i[0],i[1]]
return final_dict
Upvotes: 0
Views: 81
Reputation: 2463
Testing existence in python sets is O(1), so simply:
present = [ x for x in keys if x[0] in set(df_cols) and x[1] in set(df_cols)]
...should give you some speed up. Since you're iterating through in O(n) anyway (and have to to construct your final_dict), something like:
colset = set(df_cols)
final_dict = {k: final_df.loc[k[0],k[1]]
for k in keys if (k[0] in colset)
and (k[1] in colset)}
Would be nice, I would think.
Upvotes: 1