Reputation: 49537
I need help with regex in python.
I've got a large html file[around 400 lines] with the following pattern
text here(div,span,img tags)
<!-- 3GP||Link|| -->
text here(div,span,img tags)
So, now i am searching for a regex expression which can extract me this-:
Link
The given pattern is unique in the html file.
Upvotes: 2
Views: 4273
Reputation: 40374
In case you need to parse something else, you can also combine the regular expression with BeautifulSoup:
import re
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup(<your html here>)
link_regex = re.compile('\s+3GP\|\|(.*)\|\|\s+')
comment = soup.find(text=lambda text: isinstance(text, Comment)
and link_regex.match(text))
link = link_regex.match(comment).group(1)
print link
Note that in this case the regular expresion only needs to match the comment contents because BeautifulSoup already takes care of extracting the text from the comments.
Upvotes: 0
Reputation: 38247
>>> d = """
... Some text here(div,span,img tags)
...
... <!-- 3GP||**Some link**|| -->
...
... Some text here(div,span,img tags)
... """
>>> import re
>>> re.findall(r'\<!-- 3GP\|\|([^|]+)\|\| --\>',d)
['**Some link**']
r''
is a raw literal, it stops interpretation of standard string escapes\<!-- 3GP\|\|
is a regexp escaped match for <!-- 3GP||
([^|]+)
will match everything upto a |
and groups it for convenience\|\| --\>
is a regexp escaped match for || -->
re.findall
returns all non-overlapping matches of re pattern within a string, if there's a group expression in the re pattern, it returns that.Upvotes: 4
Reputation: 5560
import re
re.match(r"<!-- 3GP\|\|(.+?)\|\| -->", "<!-- 3GP||Link|| -->").group(1)
yields "Link"
.
Upvotes: 0