Reputation: 538
I am writing a long piece of code, which is taking way too long to execute. I used cProfile on the code, I found that the following function is called 150 times and takes 1.3 seconds per call, leading to around 200 seconds to this function alone. The function is -
def makeGsList(sentences,org):
gs_list1=[]
gs_list2=[]
for s in sentences:
if s.startswith(tuple(StartWords)):
s = s.lower()
if org=='m':
gs_list1 = [k for k in m_words if k in s]
if org=='h':
gs_list1 = [k for k in h_words if k in s]
for gs_element in gs_list1:
gs_list2.append(gs_element)
gs_list3 = list(set(gs_list2))
return gs_list3
The code is supposed to take a list of sentences and a flag org
. Then it goes through each line, checks if it starts with any of the words present in the list StartWords
, and then lower-cases it. Then, depending on the value of org
, it makes a list of all words in the current sentence which are also present in either m_words
or h_words
. It keeps appending these words to another list gs_list2
. Finally it makes a set of gs_list2
and returns it.
Can someone give me any suggestion as to how I can optimize this function to reduce the time taken to execute?
NOTE - The words h_words
/m_words
are not all single words, many of them are phrases containing 3-4 words within them.
Some examples -
StartWords = ['!Series_title','!Series_summary','!Series_overall_design','!Sample_title','!Sample_source_name_ch1','!Sample_characteristics_ch1']
sentences = [u'!Series_title\t"Transcript profiles of DCs of PLOSL patients show abnormalities in pathways of actin bundling and immune response"\n', u'!Series_summary\t"This study was aimed to identify pathways associated with loss-of-function of the DAP12/TREM2 receptor complex and thus gain insight into pathogenesis of PLOSL (polycystic lipomembranous osteodysplasia with sclerosing leukoencephalopathy). Transcript profiles of PLOSL patients\' DCs showed differential expression of genes involved in actin bundling and immune response, but also for the stability of myelin and bone remodeling."\n', u'!Series_summary\t"Keywords: PLOSL patient samples vs. control samples"\n', u'!Series_overall_design\t"Transcript profiles of in vitro differentiated DCs of three controls and five PLOSL patients were analyzed."\n', u'!Series_type\t"Expression profiling by array"\n', u'!Sample_title\t"potilas_DC_A"\t"potilas_DC_B"\t"potilas_DC_C"\t"kontrolli_DC_A"\t"kontrolli_DC_C"\t"kontrolli_DC_D"\t"potilas_DC_E"\t"potilas_DC_D"\n', u'!Sample_characteristics_ch1\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\n', u'!Sample_description\t"DAP12mut"\t"DAP12mut"\t"DAP12mut"\t"control"\t"control"\t"control"\t"TREM2mut"\t"TREM2mut"\n']
h_words = ['pp1665', 'glycerophosphodiester phosphodiesterase domain containing 5', 'gde2', 'PLOSL patients', 'actin bundling', 'glycerophosphodiester phosphodiesterase 2', 'glycerophosphodiester phosphodiesterase domain-containing protein 5']
m_words are similar.
Regarding sizes -
The length of both lists h_words
and m_words
is around 250,000. And each element in the lists is on an average 2 words long. The list of sentences is around 10-20 sentences long and I have provided an example list to give you the idea of how big each sentence can be.
Upvotes: 1
Views: 111
Reputation: 19677
m_words
and k_words
.if
statements outside of the for
loop.tuple(StartWords)
once and for all.append()
each element.set
from the start instead of a list
.for
loop.m_reg = re.compile("|".join(re.escape(w) for w in m_words))
h_reg = re.compile("|".join(re.escape(w) for w in h_words))
def make_gs_list(sentences, start_words, m_reg, h_reg, org):
if org == 'm':
reg = m_reg
elif org == 'h':
reg = h_reg
matched = {w for s in sentences if s.startswith(start_words)
for w in reg.findall(s.lower())}
return matched
Upvotes: 2
Reputation: 2220
I would try this
# optionaly change these regexes
FIRST_WORD_RE = re.compile(r"^[a-zA-Z]+")
LOWER_WORD_RE = re.compile(r"[a-z]+")
m_or_h_words = {'m': set(m_words), 'h': set(h_words)}
startwords_set = set(StartWords)
def makeGsList(sentences, org):
words = m_or_h_words[org]
gs_set2 = set()
for s in sentences:
mo = FIRST_WORD_RE.match(s)
if mo and mo.group(0) in startwords_set:
gs_set2 |= set(LOWER_WORD_RE.findall(s.lower())) & words
return list(gs_set2)
Upvotes: 1
Reputation: 2938
In Python, searching a set is much faster than searching a list, so you can always convert a your list to set then try to search word(s) on set instead of list. Here is my sample code:
for i in range(0, num_reviews):
text = raw_review["review"][i]).lower() # Convert to lower case
words = text.split() # Split into words
### convert the stopwords from list to a set
stops = set(stopwords.words("english"))
# Remove stop words from "words"
meaningful_words = [w for w in words if not w in stops]
# Join the words back into one string
BS_reviews.append(" ".join(meaningful_words))
return BS_reviews
Upvotes: 0
Reputation: 776
I think you can take your first crack at this by tokenizing your sentences
so you'd do:
sentences = tuple(s.split(' ') for s in sentences) then instead of using startswith, take your StartsWords and put them in a set
so sw_set = {w for w in StartsWords}
then as you iterate your sentences, do: if s[0] in sw_set: # proceed to the rest of your logic
I think that's where you're taking the biggest performance hit.
Upvotes: 0