Reputation: 86
I got a task to extract/find a pattern in a HTML code and extract/print it.
I am trying to extract it using regex. I am sadly a complete beginner in it.
Here is the HTML code: https://pastebin.com/cfvtLpZZ
And here is a part of the code I need to extract:
<span>Re: Máte zprávu od ubytování Lanterna Sunny Resort by Valamar<br> <br> Excuse me, but I have no pets.Please, I want ground floor, no stairs.Is it possible? Thank you for your answer.Hana Seidlová </span>
Inside of it, I need to get a match between Re: and /span.
This is a regex pattern I tried to use: "^Re:.*span$"
The code:
import re
HTMLcode = str(input("Enter the code you wanna scrape: "))
def scrape(HTMLcode):
HTMLscrape = re.search("^Re:.*span$", HTMLcode)
print(HTMLscrape.group(0))
scrape(HTMLcode)
Issue I am having is that HTMLscrape is seen as a NoneType
by interpreter.
I also tried this code, but I had no luck:
def scrape(HTMLcode):
HTMLcompile = re.compile("^Re:.*span")
HTMLsearch = HTMLcompile.search(HTMLcode)
print(HTMLsearch.group(0))
I also tried using regex 101 website but there it says that no pattern can be found.
What is the issue? Any type of explanation or info/feedback is appreciated!!!
Upvotes: 0
Views: 104
Reputation: 2083
You want lookbehinds and lookaheads to exclude the <span> Re:
and </span>
, respectively. You can use this regex for that:
(?<=<span>Re:).*(?=</span>)
See regex101 example
Upvotes: 0
Reputation: 1290
Using https://regex101.com is a good way to find if your regex is good or not. For instance yours is not. ^Re:
means that you're looking for something that starts the line with Re:
and that just after you're looking for everything(.*
) and that ends with span
.
Isnt'
re.search("<span>Re:(.*?)</span>")
more like what you want (starts with <span>Re:
and capture everything until </span>
)
Upvotes: 1