Reputation: 13
I'm trying to capture some elements from the html code of a certain url. When I copy and paste the contents of the html directly to into my python code it works well.
import re
# Sample HTML content
html_content = """
<<<HTML Code>>>
"""
# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'
# Find matches
matches = re.findall(pattern, html_content)
# Print matches
for match in matches:
print(match)
^^ works well. But when I try to do the same by directly using requests.get it doesn't work:
import re
import requests
url = "https://asuracomic.net/series/bloodhounds-regression-instinct-2d0edc16/chapter/59"
response = requests.get(url)
html_content = response.text
# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'
# Find matches
matches = re.findall(pattern, html_content)
# Print matches
for match in matches:
print(match)
Keeping in mind that the html I'm copying and pasting is actually generated using requests.get:
with open('raw_html.html', 'w', encoding='utf-8') as f:
f.write(html_content)
Upvotes: 0
Views: 39
Reputation: 13
I managed to solve the problem by adding:
no_bs = html_content.replace('\\"', '"')
which removes what appears to be back spaces that are not replicated when copying and pasting the html code manually. Making the final code looks like this:
import re
import requests
url = "https://asuracomic.net/series/bloodhounds-regression-instinct-2d0edc16/chapter/59"
response = requests.get(url)
html_content = response.text
no_bs = html_content.replace('\\"', '"')
# Regex pattern
pattern = r'{"order":\d+,"url":"(https:[^"]+\.webp)"}'
# Find matches
matches = re.findall(pattern, no_bs)
# Print matches
for match in matches:
print(match)
Upvotes: 1