Reputation: 14008

regex needed to match anything within p tags

I need a regular expression to match anything that is within  tags so for example if I had some text:

<p>Hello world</p>

The regex would match the Hello world part

Upvotes: 6

Answers (6)

Tommy Cunningham

Reputation: 130

For anybody looking into this Regex or any other regex to match specific HTML tags, this Regex below will work as needed:

<\s*p[^>]*>(.*?)<\s*\/\s*p\s*>

This will match strings like the below strings as mentioned in xzyfer's answer:

<p>I would like <b>all</b> the text!</p> < p style=  "font-weight: bold;" >Hello world  <  /  p >

Link to the Regex on Regex101 here: https://regex101.com/r/kjpLII

If you would like to use the Regex for other HTML tags instead of just p tags you can change the p's in the Regex to whichever HTML tag you wish to match:

<\s*div[^>]*>(.*?)<\s*\/\s*div\s*>

Upvotes: 0

Ali

Reputation: 121

You can use this in Python as a comprehensive solution:

import re
import bs4
import requests

page = requests.get(link)
page_content = bs4.BeautifulSoup(page.content,'html.parser')
result = page_content.find_all('p')

Upvotes: 2

Alexander Romanov

Reputation: 71

It seems that the above proposed solutions will fail either:

to return text within ... tags whenever it contains other tags like <a>, , etc. or
to distinguish between  and <path> or
to include tags with attributes like

Consider using this regex:

<p(|\s+[^>]*)>(.*?)<\/p\s*>

Resulting text will be captured in group 2.

Obviously, this solution won't work properly whenever closing tag  will be for some reason enclosed in comment tags  ... 

Upvotes: 7

xzyfer

Reputation: 14135

in javascript:

var str = "<p>Hello world</p>";
str.search(/<\s*p[^>]*>([^<]*)<\s*\/\s*p\s*>/)

in php:

$str = "<p>Hello world</p>";
preg_match_all("/<\s*p[^>]*>([^<]*)<\s*\/\s*p\s*>/", $str);

These will match something as complex as this

< p style=  "font-weight: bold;" >Hello world  <  /  p >

Upvotes: 11

Kimvais

Reputation: 39628

EDIT: Don't do it. Just don't.

See this question

If you insist, use (.+?) and the result will be in the first group. It is not perfect, but no regexp solution to HTML parsing problem will ever be.

E.g (in python)

>>> import re
>>> r = re.compile('<p>(.+?)</p>')
>>> r.findall("<p>fo o</p><p>ba adr</p>")
['fo o', 'ba adr']

Upvotes: 7

dogbane

Reputation: 274878

Regex:

<([a-z][a-z0-9]*)\b[^>]*>(.*?)</\1>

This will work for any pair of tags.

e.g hello 

The \1 makes sure that the opening tag matches the closing tag.

The content between the tags is captured in \2.

Upvotes: 1

regex needed to match anything within p tags

Answers (6)

Related Questions