Wizard
Wizard

Reputation: 22083

Choose the parts between lookahead and lookbehind

I'd like to retrieve data from jobs info and output structured json, one jobs detail like

In [185]: text = """Company
     ...: 
     ...: Stack Overflow
     ...: 
     ...: Job Title
     ...: 
     ...: Student
     ...: 
     ...: Job Description
     ...: 
     ...: Our client is providing the innovative technologies, ....
     ...: 
     ...: Requirements
     ...: .....
     ...: About the Company
     ...: 
     ...: At ...., we are a specialized ..
     ...: 
     ...: Contact Info
     ...: ...
     ...: """

I tried to extract with named group

jobs_regex = re.compile(r"""
(?P<company>Company(?<=Company).*(?:=Job Title))
# the parts between "Company and Job Title
(?P<job_title>Job Title(?<=Job Title).*(?:=Job Description))
# the parts between "Job Title and Job Description
....
""",re.VERBOSE)

However, when I run it get empty list

In [188]: jobs_regex.findall(text)
Out[188]: []

How could I solve the problem with lookaround (?:) (?<=)?

Upvotes: 2

Views: 56

Answers (3)

Yanis.F
Yanis.F

Reputation: 684

I don't know if you really want to use the lookarounds but here is a simple solution not using them :

Company(?P<company>.*)Job Title(?P<job_title>.*)Job Description

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

The main point here is that your re.VERBOSE pattern treats any literal whitespace as formatting whitespace. To match a literal space in such patterns, you need to escape it, e.g. Job Description => Job\ Description, or replace with \s shorthand character class. As a side note, if you plan to add # there, also escape this char as it starts a comment in verbose regexps.

Another minor issue is that you try to match two substrings consecutively, while they do not follow each other in your input. A possible solution here is to divide the two patterns with an alternation operator, |.

Here is a fixed pattern:

jobs_regex = re.compile(r"""
    (?<=Company).*?(?:=Job\ Title)
      # the parts between "Company and Job Title
    | # or
    (?P<job_title>Job\ Title).*?(?:Job\ Description)
      # the parts between "Job Title and Job Description
""", re.VERBOSE)

See the regex demo

I left the named groups and other groupings that do not harm the regex as it seems to be a part of some longer pattern, please make sure these groupings make sense in your final regex.

Upvotes: 1

Yunnosch
Yunnosch

Reputation: 26703

With this

(?P<company>Company(?<=Company).*(?:=Job Title))

you unnecessarily require "Company" explicitly to be there, in addition to the positive lookbehind and the lookahead is broken.

So this will fix the problem by ONLY asking for the lookbehind to match and fixing the lookahead:

(?P<company>(?<=Company).*(?=Job Title))

Upvotes: 1

Related Questions