Reputation: 49537

Extract text between pattern using REGEX

I need help with regex in python.

I've got a large html file[around 400 lines] with the following pattern

text here(div,span,img tags)

<!-- 3GP||Link|| --> 

text here(div,span,img tags)

So, now i am searching for a regex expression which can extract me this-:

Link

The given pattern is unique in the html file.

Upvotes: 2

Answers (3)

jcollado

Reputation: 40374

In case you need to parse something else, you can also combine the regular expression with BeautifulSoup:

import re
from BeautifulSoup import BeautifulSoup, Comment

soup = BeautifulSoup(<your html here>)
link_regex = re.compile('\s+3GP\|\|(.*)\|\|\s+')
comment = soup.find(text=lambda text: isinstance(text, Comment)
                    and link_regex.match(text))
link = link_regex.match(comment).group(1)
print link

Note that in this case the regular expresion only needs to match the comment contents because BeautifulSoup already takes care of extracting the text from the comments.

Upvotes: 0

MattH

Reputation: 38247

>>> d = """
... Some text here(div,span,img tags)
...
... <!-- 3GP||**Some link**|| -->
...
... Some text here(div,span,img tags)
... """
>>> import re
>>> re.findall(r'\<!-- 3GP\|\|([^|]+)\|\| --\>',d)
['**Some link**']

r'' is a raw literal, it stops interpretation of standard string escapes
\<!-- 3GP\|\| is a regexp escaped match for <!-- 3GP||
([^|]+) will match everything upto a | and groups it for convenience
\|\| --\> is a regexp escaped match for || -->
re.findall returns all non-overlapping matches of re pattern within a string, if there's a group expression in the re pattern, it returns that.

Upvotes: 4

Jan Pöschko

Reputation: 5560

import re
re.match(r"<!-- 3GP\|\|(.+?)\|\| -->", "<!-- 3GP||Link|| -->").group(1)

yields "Link".

Upvotes: 0

Extract text between pattern using REGEX

Answers (3)

Related Questions