Reputation: 1374
I have a written a scraper in python. I have a group of strings which i want to search on the page and from the result of that, i want to remove those results which contains words from another group of strings i have.
Here is the code -
def find_jobs(self, company, soup):
allowed = re.compile(r"Developer|Engineer|Designer|Admin|Manager|Writer|Executive|Lead|Analyst|Editor|"
r"Associate|Architect|Recruiter|Specialist|Scientist|Support|Expert|SSE|Head|"
r"Producer|Evangelist|Ninja", re.IGNORECASE)
not_allowed = re.compile(r"^responsibilities$|^description$|^requirements$|^experience$|^empowering$|^engineering$|^"
r"find$|^skills$|^recruiterbox$|^google$|^communicating$|^associated$|^internship$|^you$|^"
r"proficient$|^leadsquared$|^referral$|^should$|^must$|^become$|^global$|^degree$|^good$|^"
r"capabilities$|^leadership$|^services$|^expertise$|^architecture$|^hire$|^follow$|^jobs$|^"
r"procedures$|^conduct$|^perk$|^missed$|^generation$|^search$|^tools$|^worldwide$|^contact$|^"
r"question$|^intern$|^classes$|^trust$|^ability$|^businesses$|^join$|^industry$|^response$|^"
r"using$|^work$|^based$|^grow$|^provide$|^understand$|^header$|^headline$|^masthead$|^office$", re.IGNORECASE)
profile_list = set()
k = soup.body.findAll(text=allowed)
for i in k:
if len(i) < 60 and not_allowed.search(i) is None:
profile_list.add(i.strip().upper())
self.update_jobs(company, profile_list)
So I am facing a problem here. With the anchor tags in not_allowed
, strings such as //HEADLINE-BG
and ABILITY TO LEAD & MENTOR A TEAM
are getting through, although i have the strings headline
and ability
in not_allowed
. These are removed if i remove the anchor tags but then a string such as SCALABILITY ENGINEER
does not get saved due to string ability
in not_allowed
.So being a newbie in regex, i am not sure how can i get this to work. Earlier i was using this -
def find_jobs(self, company, soup):
allowed = re.compile(r"Developer|Designer|Engineer|Admin|Manager|Writer|Executive|Lead|Analyst|Editor|"
r"Associate|Architect|Recruiter|Specialist|Scientist|Support|Expert|SSE|Head"
r"Producer|Evangelist|Ninja", re.IGNORECASE)
not_allowed = ['responsibilities', 'description', 'requirements', 'experience', 'empowering', 'engineering',
'find', 'skills', 'recruiterbox', 'google', 'communicating', 'associated', 'internship',
'proficient', 'leadsquared', 'referral', 'should', 'must', 'become', 'global', 'degree', 'good',
'capabilities', 'leadership', 'services', 'expertise', 'architecture', 'hire', 'follow',
'procedures', 'conduct', 'perk', 'missed', 'generation', 'search', 'tools', 'worldwide', 'contact',
'question', 'intern', 'classes', 'trust', 'ability', 'businesses', 'join', 'industry', 'response', 'you', 'using', 'work', 'based', 'grow', 'provide']
profile_list = set()
k = soup.body.findAll(text=allowed)
for i in k:
if len(i) < 60 and not any(x in i.lower() for x in not_allowed):
profile_list.add(i.strip().upper())
self.update_jobs(company, profile_list)
But this also omitted a string if a substring was present in not_allowed
. Please can anyone help with this.
Upvotes: 0
Views: 54
Reputation: 1910
The regex
^ability$
Means "the line consists only of the word "ability". If you want sub-strings, just change to
ability
If you want to omit the word "ability", but not "disability", then use something like
\bability\b
Upvotes: 0
Reputation: 617
It looks like your are writing your notallowed regex wrongly. Your notallowed regex is actually looking for those words to be the only item on the line.
re.compile(r'^something_i_dont_like$')
is going to match something_i_dont_like if it is the only item on the line
if you want to omit something, you need to do a negative lookahead
re.compile(r'^((?!something_i_dont_like).)*$')
Upvotes: 1