Addoodi
Addoodi

Reputation: 13

re.findall with requests doesn't match copied and pasted html (generated by requests.text)

I'm trying to capture some elements from the html code of a certain url. When I copy and paste the contents of the html directly to into my python code it works well.

import re

# Sample HTML content
html_content = """
<<<HTML Code>>>
"""

# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'

# Find matches
matches = re.findall(pattern, html_content)

# Print matches
for match in matches:
    print(match)

^^ works well. But when I try to do the same by directly using requests.get it doesn't work:

import re
import requests
url = "https://asuracomic.net/series/bloodhounds-regression-instinct-2d0edc16/chapter/59"
response = requests.get(url)
html_content = response.text

# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'

# Find matches
matches = re.findall(pattern, html_content)

# Print matches
for match in matches:
    print(match)

Keeping in mind that the html I'm copying and pasting is actually generated using requests.get:

with open('raw_html.html', 'w', encoding='utf-8') as f:
    f.write(html_content)

Upvotes: 0

Views: 39

Answers (1)

Addoodi
Addoodi

Reputation: 13

I managed to solve the problem by adding:

no_bs = html_content.replace('\\"', '"')

which removes what appears to be back spaces that are not replicated when copying and pasting the html code manually. Making the final code looks like this:

import re
import requests
url = "https://asuracomic.net/series/bloodhounds-regression-instinct-2d0edc16/chapter/59"
response = requests.get(url)
html_content = response.text
no_bs = html_content.replace('\\"', '"')
# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'

# Find matches
matches = re.findall(pattern, no_bs)

# Print matches
for match in matches:
    print(match)

Upvotes: 1

Related Questions