Regex: Retrieve values from one of the two HTML tags.

Question

I'm using OutWit Hub to scrape company names from a website.

In some pages, the HTML tag is like this:

COMPANY NAME

while in other pages:

COMPANY NAME

All the pages use one of the above two options, but never both.

If you're not familiar with OutWit Hub, it works by asking for the marker before, and the marker after the piece of information you want.

I'm trying to create a Regex that will retrieve the company name, regardless of which one of those markers is used whether before or after.

So far I have tried this for the 'before' tag, but it doesn't work:

/[]|[Name of Company: ]/

Can anyone help?

Richard Ev · Accepted Answer

Lose the square brackets ([...]), these are used to specify a character class or character set, not a sequence of characters.

/|Name of Company: /

For help understanding and debugging regular expressions, check out Regexpr.

However, as others have commented, regular expressions aren't the most reliable approach to parsing HTML. For example, how do you know that there will never be any other paragraphs or spans on the page with a style of font-weight: bold?

If you know C# then the HTML Agility Pack is a useful library for parsing HTML. It may be overkill for your needs though.

Regex: Retrieve values from one of the two HTML tags.

Answers (2)

Related Questions