Python regex for removing scraping results according to substrings?

Question

I have a written a scraper in python. I have a group of strings which i want to search on the page and from the result of that, i want to remove those results which contains words from another group of strings i have.

Here is the code -

def find_jobs(self, company, soup):
        allowed = re.compile(r"Developer|Engineer|Designer|Admin|Manager|Writer|Executive|Lead|Analyst|Editor|"
                             r"Associate|Architect|Recruiter|Specialist|Scientist|Support|Expert|SSE|Head|"
                             r"Producer|Evangelist|Ninja", re.IGNORECASE)
        not_allowed = re.compile(r"^responsibilities$|^description$|^requirements$|^experience$|^empowering$|^engineering$|^"
                                 r"find$|^skills$|^recruiterbox$|^google$|^communicating$|^associated$|^internship$|^you$|^"
                                 r"proficient$|^leadsquared$|^referral$|^should$|^must$|^become$|^global$|^degree$|^good$|^"
                                 r"capabilities$|^leadership$|^services$|^expertise$|^architecture$|^hire$|^follow$|^jobs$|^"
                                 r"procedures$|^conduct$|^perk$|^missed$|^generation$|^search$|^tools$|^worldwide$|^contact$|^"
                                 r"question$|^intern$|^classes$|^trust$|^ability$|^businesses$|^join$|^industry$|^response$|^"
                                 r"using$|^work$|^based$|^grow$|^provide$|^understand$|^header$|^headline$|^masthead$|^office$", re.IGNORECASE)

        profile_list = set()
        k = soup.body.findAll(text=allowed)
        for i in k:
            if len(i) < 60 and not_allowed.search(i) is None:
                profile_list.add(i.strip().upper())
        self.update_jobs(company, profile_list)

So I am facing a problem here. With the anchor tags in not_allowed, strings such as //HEADLINE-BG and ABILITY TO LEAD & MENTOR A TEAM are getting through, although i have the strings headline and ability in not_allowed. These are removed if i remove the anchor tags but then a string such as SCALABILITY ENGINEER does not get saved due to string ability in not_allowed.So being a newbie in regex, i am not sure how can i get this to work. Earlier i was using this -

def find_jobs(self, company, soup):
        allowed = re.compile(r"Developer|Designer|Engineer|Admin|Manager|Writer|Executive|Lead|Analyst|Editor|"
                             r"Associate|Architect|Recruiter|Specialist|Scientist|Support|Expert|SSE|Head"
                             r"Producer|Evangelist|Ninja", re.IGNORECASE)
        not_allowed = ['responsibilities', 'description', 'requirements', 'experience', 'empowering', 'engineering',
                       'find', 'skills', 'recruiterbox', 'google', 'communicating', 'associated', 'internship',
                       'proficient', 'leadsquared', 'referral', 'should', 'must', 'become', 'global', 'degree', 'good',
                       'capabilities', 'leadership', 'services', 'expertise', 'architecture', 'hire', 'follow',
                       'procedures', 'conduct', 'perk', 'missed', 'generation', 'search', 'tools', 'worldwide', 'contact',
                       'question', 'intern', 'classes', 'trust', 'ability', 'businesses', 'join', 'industry', 'response', 'you', 'using', 'work',              'based', 'grow', 'provide']

        profile_list = set()
        k = soup.body.findAll(text=allowed)
        for i in k:
            if len(i) < 60 and not any(x in i.lower() for x in not_allowed):
                profile_list.add(i.strip().upper())
        self.update_jobs(company, profile_list)

But this also omitted a string if a substring was present in not_allowed. Please can anyone help with this.

swstephe · Accepted Answer

The regex

^ability$

Means "the line consists only of the word "ability". If you want sub-strings, just change to

ability

If you want to omit the word "ability", but not "disability", then use something like

\bability\b

Python regex for removing scraping results according to substrings?

Answers (2)

Related Questions