user2087008
user2087008

Reputation:

How to match either/or with Regex

I apologise for the Regex question but..

I'm developing a regex expression for scraping Job titles.

The job Title will always be in the format:

Job title: Word1 Word2 (Optional Word3)

At the minute I currently have this:

Job Title: ([A-Z\w]+ [A-Z\w]+)|Job Title: ([A-Z\w]+ [A-Z\w]+ [A-Z\w]+)

I'm trying to get it to match job titles with either two or three words, and each side of the pipe character works individually (left side matches 2 word job titles, right side matchs 3 word job titles), however when I add the pipe character it only goes for the left half, matching 2 word job titles.

Does anybody have an idea what I'm doing wrong?

NB: I'm using Regexper to visualise my expression, and it looks correct there.

Cheers.

Upvotes: 4

Views: 5440

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626804

The reason for that is that the left part can match the same substring (at the same position) what the right alternative can match, and the pattern is not anchored. You either need to anchor it, or swap the alternatives. Or use an optional group. Here is an enhanced version:

Job Title: ([A-Z]\w* [A-Z]\w*(?: [A-Z]\w*)?)
                             ^^^^^^^^^^^^^^

See the regex demo

If you do not care if the initial letters are in lower- or uppercase, add the /i case-insensitive modifier (or the corresponding flag (like re.I, RegexOptions.IgnoreCase, etc.) depending on the regex flavor):

/Job Title: ([A-Z]\w* [A-Z]\w*(?: [A-Z]\w*)?)/i

Since the [A-Z\w]+ makes little sense as \w matches A-Z, I advise to use [A-Z]\w* - an uppercase ASCII letter followed with zero or more alphanumeric/underscore characters.

The non-capturing group (?: [A-Z]\w*) is made optional (this part can be missing from the input) since it is quantified with ? quantifier that means one or zero occurrences.

Upvotes: 1

Adrien Brunelat
Adrien Brunelat

Reputation: 4642

How about that:

Job Title: ((?: *[A-Z]\w+){2,3})

See it in action here

That way, if the number of words accepted changes at some point, you don't have much to change to adapt the solution.

You can add $ at the end if you don't want to match the case where people enter more than 3 words: like this

Upvotes: 0

Related Questions