MatejR
MatejR

Reputation: 86

Regex always returns as NoneType or doesn't find any pattern at all

I got a task to extract/find a pattern in a HTML code and extract/print it.

I am trying to extract it using regex. I am sadly a complete beginner in it.

Here is the HTML code: https://pastebin.com/cfvtLpZZ

And here is a part of the code I need to extract:

<span>Re: Máte zprávu od ubytování Lanterna Sunny Resort by Valamar<br>  <br>  Excuse me, but I have no pets.Please, I want ground floor, no stairs.Is it possible? Thank you for your answer.Hana Seidlová  </span>

Inside of it, I need to get a match between Re: and /span.

This is a regex pattern I tried to use: "^Re:.*span$"

The code:

import re

HTMLcode = str(input("Enter the code you wanna scrape: "))

def scrape(HTMLcode):
  HTMLscrape = re.search("^Re:.*span$", HTMLcode)

  print(HTMLscrape.group(0))

scrape(HTMLcode)

Issue I am having is that HTMLscrape is seen as a NoneType by interpreter.

I also tried this code, but I had no luck:

def scrape(HTMLcode):
  HTMLcompile = re.compile("^Re:.*span")

  HTMLsearch = HTMLcompile.search(HTMLcode)

  print(HTMLsearch.group(0))

I also tried using regex 101 website but there it says that no pattern can be found.

What is the issue? Any type of explanation or info/feedback is appreciated!!!

Upvotes: 0

Views: 104

Answers (2)

LeoE
LeoE

Reputation: 2083

You want lookbehinds and lookaheads to exclude the <span> Re: and </span>, respectively. You can use this regex for that:

(?<=<span>Re:).*(?=</span>)

See regex101 example

Upvotes: 0

Orphee Faucoz
Orphee Faucoz

Reputation: 1290

Using https://regex101.com is a good way to find if your regex is good or not. For instance yours is not. ^Re: means that you're looking for something that starts the line with Re: and that just after you're looking for everything(.*) and that ends with span.

Isnt'

re.search("<span>Re:(.*?)</span>")

more like what you want (starts with <span>Re: and capture everything until </span>)

Upvotes: 1

Related Questions