Reputation: 89
Trying scrape data from a HTML file, which has a react props DIV in it like this:
<html>
<div data-react-class="UserDetails"
data-react-props="{
"targetUser
":{
"targetUserLogin":"user",
"targetUserDuration":"11 months, 27 days","""
}
}
I have no idea how to accurately get this data since different person can be 2 years exactly and no days would be in the text. I need both year and days so I can calculate. so I wrote this to find the the part of the code that I need, but I don't know to how to approach the rest..
with open("data.html", 'r') as fpIn:
for line in fpIn:
line = line.rstrip() # Strip trailing spaces and newline
if "targetUserDuration" in line:
print("Found")
Upvotes: 0
Views: 136
Reputation: 13097
I would probably start by looking at "BeautifulSoup". I think it will unescape automatically. I know it is more libraries to load, but I would use html.unescape()
and json.loads()
as this seems to naturally fit the way the data is provided rather an try to parse it myself. Hand parsing seems unnecessarily brittle here.
from html import unescape
from json import loads
text = """
{
"targetUser":{
"targetUserLogin":"user",
"targetUserDuration":"11 months, 27 days"
}
}
"""
print(loads(unescape(text))["targetUser"]["targetUserDuration"])
Gives you:
11 months, 27 days
Upvotes: 2
Reputation: 353
Use regular expresions to find it.
import re
html = '..."targetUserDuration":"11 months, 27 days","""...'
years_re = re.compile(r'UserDuration".*?([1-9]+) year.*?"""')
months_re = re.compile(r'UserDuration".*?([1-9]|1[0-2]) month.*?"""')
days_re = re.compile(r'UserDuration".*?([1-9]|2[0-9]|3[0-1]) day.*?"""')
year_found = years_re.search(html)
months_found = months_re.search(html)
days_found = days_re.search(html)
years, months, days = 0, 0, 0
if year_found:
years = int(year_found.group(1))
if months_found:
months = int(months_found.group(1))
if days_found:
days = int(days_found.group(1))
print('years: ', years)
print('months: ', months)
print('days: ', days)
Result:
years: 0
months: 11
days: 27
Upvotes: 2