Henry Taylor
Henry Taylor

Reputation: 17

Regex: Retrieve values from one of the two HTML tags.

I'm using OutWit Hub to scrape company names from a website.

In some pages, the HTML tag is like this:

<p style="font-weight: bold;">COMPANY NAME</p>

while in other pages:

<span style="font-weight: bold;">COMPANY NAME</span>

All the pages use one of the above two options, but never both.

If you're not familiar with OutWit Hub, it works by asking for the marker before, and the marker after the piece of information you want.

I'm trying to create a Regex that will retrieve the company name, regardless of which one of those markers is used whether before or after.

So far I have tried this for the 'before' tag, but it doesn't work:

/[<p style="font-weight: bold;">]|[<p>Name of Company: <span style="font-weight: bold;">]/

Can anyone help?

Upvotes: 0

Views: 133

Answers (2)

Santosh Panda
Santosh Panda

Reputation: 7341

You can use this Regular expression & take the 2nd Group data to get the Company Name:

^(<p style="font-weight: bold;">|<span style="font-weight: bold;">)(.*)(</p>|</span>)

Upvotes: 0

Richard Ev
Richard Ev

Reputation: 54087

Lose the square brackets ([...]), these are used to specify a character class or character set, not a sequence of characters.

/<p style="font-weight: bold;">|<p>Name of Company: <span style="font-weight: bold;">/

For help understanding and debugging regular expressions, check out Regexpr.

However, as others have commented, regular expressions aren't the most reliable approach to parsing HTML. For example, how do you know that there will never be any other paragraphs or spans on the page with a style of font-weight: bold?

If you know C# then the HTML Agility Pack is a useful library for parsing HTML. It may be overkill for your needs though.

Upvotes: 1

Related Questions