Reputation: 538

Can this function be optimized for speed?

I am writing a long piece of code, which is taking way too long to execute. I used cProfile on the code, I found that the following function is called 150 times and takes 1.3 seconds per call, leading to around 200 seconds to this function alone. The function is -

def makeGsList(sentences,org):
    gs_list1=[]
    gs_list2=[]
    for s in sentences:
        if s.startswith(tuple(StartWords)):
            s = s.lower()
            if org=='m':
                gs_list1 = [k for k in m_words if k in s]
            if org=='h':
                gs_list1 = [k for k in h_words if k in s]
            for gs_element in gs_list1:
                gs_list2.append(gs_element)
    gs_list3 = list(set(gs_list2))
    return gs_list3

The code is supposed to take a list of sentences and a flag org. Then it goes through each line, checks if it starts with any of the words present in the list StartWords, and then lower-cases it. Then, depending on the value of org, it makes a list of all words in the current sentence which are also present in either m_words or h_words. It keeps appending these words to another list gs_list2. Finally it makes a set of gs_list2 and returns it.

Can someone give me any suggestion as to how I can optimize this function to reduce the time taken to execute?

NOTE - The words h_words/m_words are not all single words, many of them are phrases containing 3-4 words within them.

Some examples -

StartWords = ['!Series_title','!Series_summary','!Series_overall_design','!Sample_title','!Sample_source_name_ch1','!Sample_characteristics_ch1']

sentences = [u'!Series_title\t"Transcript profiles of DCs of PLOSL patients show abnormalities in pathways of actin bundling and immune response"\n', u'!Series_summary\t"This study was aimed to identify pathways associated with loss-of-function of the DAP12/TREM2 receptor complex and thus gain insight into pathogenesis of PLOSL (polycystic lipomembranous osteodysplasia with sclerosing leukoencephalopathy). Transcript profiles of PLOSL patients\' DCs showed differential expression of genes involved in actin bundling and immune response, but also for the stability of myelin and bone remodeling."\n', u'!Series_summary\t"Keywords: PLOSL patient samples vs. control samples"\n', u'!Series_overall_design\t"Transcript profiles of in vitro differentiated DCs of three controls and five PLOSL patients were analyzed."\n', u'!Series_type\t"Expression profiling by array"\n', u'!Sample_title\t"potilas_DC_A"\t"potilas_DC_B"\t"potilas_DC_C"\t"kontrolli_DC_A"\t"kontrolli_DC_C"\t"kontrolli_DC_D"\t"potilas_DC_E"\t"potilas_DC_D"\n',  u'!Sample_characteristics_ch1\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\n', u'!Sample_description\t"DAP12mut"\t"DAP12mut"\t"DAP12mut"\t"control"\t"control"\t"control"\t"TREM2mut"\t"TREM2mut"\n']

h_words = ['pp1665', 'glycerophosphodiester phosphodiesterase domain containing 5', 'gde2', 'PLOSL patients', 'actin bundling', 'glycerophosphodiester phosphodiesterase 2', 'glycerophosphodiester phosphodiesterase domain-containing protein 5']

m_words are similar.

Regarding sizes -

The length of both lists h_words and m_words is around 250,000. And each element in the lists is on an average 2 words long. The list of sentences is around 10-20 sentences long and I have provided an example list to give you the idea of how big each sentence can be.

Upvotes: 1

Answers (4)

Delgan

Reputation: 19677

Do not use global variables for m_words and k_words.
Put the if statements outside of the for loop.
Cast tuple(StartWords) once and for all.
Use a programatically created regex instead of list comprehension.
Pre-compile everything you can.
Directly extend your list instead of iterating trough it to append() each element.
Use a set from the start instead of a list.
Use set comprehension instead of explicit for loop.

m_reg = re.compile("|".join(re.escape(w) for w in m_words))
h_reg = re.compile("|".join(re.escape(w) for w in h_words))

def make_gs_list(sentences, start_words, m_reg, h_reg, org):
    if org == 'm':
        reg = m_reg
    elif org == 'h':
        reg = h_reg

    matched = {w for s in sentences if s.startswith(start_words)
                 for w in reg.findall(s.lower())}

    return matched

Upvotes: 2

Gribouillis

Reputation: 2220

I would try this

# optionaly change these regexes
FIRST_WORD_RE = re.compile(r"^[a-zA-Z]+")
LOWER_WORD_RE = re.compile(r"[a-z]+")
m_or_h_words = {'m': set(m_words), 'h': set(h_words)}
startwords_set = set(StartWords)

def makeGsList(sentences, org):
    words = m_or_h_words[org]
    gs_set2 = set()
    for s in sentences:
        mo = FIRST_WORD_RE.match(s)
        if mo and mo.group(0) in startwords_set:
            gs_set2 |= set(LOWER_WORD_RE.findall(s.lower())) & words
    return list(gs_set2)

Upvotes: 1

imz22

Reputation: 2938

In Python, searching a set is much faster than searching a list, so you can always convert a your list to set then try to search word(s) on set instead of list. Here is my sample code:

 for i in range(0, num_reviews):
    text = raw_review["review"][i]).lower()  # Convert to lower case
    words = text.split()  # Split into words
    ### convert the stopwords from list to a set
    stops = set(stopwords.words("english"))
    # Remove stop words from "words"
    meaningful_words = [w for w in words if not w in stops]
    # Join the words back into one string
    BS_reviews.append(" ".join(meaningful_words))
 return BS_reviews

Upvotes: 0

Hashibuto

Reputation: 776

I think you can take your first crack at this by tokenizing your sentences

so you'd do:

Use a regex here instead of split, but just for illustration use split

sentences = tuple(s.split(' ') for s in sentences) then instead of using startswith, take your StartsWords and put them in a set

so sw_set = {w for w in StartsWords}

then as you iterate your sentences, do: if s[0] in sw_set: # proceed to the rest of your logic

I think that's where you're taking the biggest performance hit.

Upvotes: 0

Can this function be optimized for speed?

Answers (4)

Use a regex here instead of split, but just for illustration use split

Related Questions