geoffs3310
geoffs3310

Reputation: 14008

regex needed to match anything within p tags

I need a regular expression to match anything that is within <p> tags so for example if I had some text:

<p>Hello world</p>

The regex would match the Hello world part

Upvotes: 6

Views: 32079

Answers (6)

Tommy Cunningham
Tommy Cunningham

Reputation: 130

For anybody looking into this Regex or any other regex to match specific HTML tags, this Regex below will work as needed:

<\s*p[^>]*>(.*?)<\s*\/\s*p\s*>

This will match strings like the below strings as mentioned in xzyfer's answer:

<p>I would like <b>all</b> the text!</p> < p style=  "font-weight: bold;" >Hello world  <  /  p >

Link to the Regex on Regex101 here: https://regex101.com/r/kjpLII


If you would like to use the Regex for other HTML tags instead of just p tags you can change the p's in the Regex to whichever HTML tag you wish to match:

<\s*div[^>]*>(.*?)<\s*\/\s*div\s*>

Upvotes: 0

Ali
Ali

Reputation: 121

You can use this in Python as a comprehensive solution:

import re
import bs4
import requests

page = requests.get(link)
page_content = bs4.BeautifulSoup(page.content,'html.parser')
result = page_content.find_all('p')

Upvotes: 2

Alexander Romanov
Alexander Romanov

Reputation: 71

It seems that the above proposed solutions will fail either:

  • to return text within <p>...</p> tags whenever it contains other tags like <a>, <em>, etc. or
  • to distinguish between <p> and <path> or
  • to include tags with attributes like <p class="content">

Consider using this regex:

<p(|\s+[^>]*)>(.*?)<\/p\s*>

Resulting text will be captured in group 2.


Obviously, this solution won't work properly whenever closing tag </p> will be for some reason enclosed in comment tags <p> ... <!-- ... </p> ... -->

Upvotes: 7

xzyfer
xzyfer

Reputation: 14135

in javascript:

var str = "<p>Hello world</p>";
str.search(/<\s*p[^>]*>([^<]*)<\s*\/\s*p\s*>/)

in php:

$str = "<p>Hello world</p>";
preg_match_all("/<\s*p[^>]*>([^<]*)<\s*\/\s*p\s*>/", $str);

These will match something as complex as this

< p style=  "font-weight: bold;" >Hello world  <  /  p >

Upvotes: 11

Kimvais
Kimvais

Reputation: 39628

EDIT: Don't do it. Just don't.

See this question

If you insist, use <p>(.+?)</p> and the result will be in the first group. It is not perfect, but no regexp solution to HTML parsing problem will ever be.

E.g (in python)

>>> import re
>>> r = re.compile('<p>(.+?)</p>')
>>> r.findall("<p>fo o</p><p>ba adr</p>")
['fo o', 'ba adr']

Upvotes: 7

dogbane
dogbane

Reputation: 274878

Regex:

<([a-z][a-z0-9]*)\b[^>]*>(.*?)</\1>

This will work for any pair of tags.

e.g <p class="foo">hello<br/></p>

The \1 makes sure that the opening tag matches the closing tag.

The content between the tags is captured in \2.

Upvotes: 1

Related Questions