Reputation: 145
I am trying to figure out how to extract some info from a html document using RegEx (it must be regex not any other html parser). The html document I want to extract from is called: "website1.html"
. It has this data below:
<div class="category"><div class="comedy">Category1</div></div>
<p class="desc">Title1</p>
<p class="date">Date1/p>
<div class="category"><div class="comedy">Category2</div></div>
<p class="desc">Title2</p>
<p class="date">Date2/p>
How could I first select the html document so that python can read it, and then extract the information from: class="comedy", class="desc", and class="date"
using regex findall
expressions?
I want them to be in separate lists so that I end up with: ["Title1", "Title2"]
in one list and ["Category1", "Category2"]
in another etc.
I have the overall process mapped in my head but I dont know the specific characters/functions to use.
Upvotes: 0
Views: 77
Reputation: 2348
You can accomplish it using regular expression
as the following example:
import re
filename = 'path\\website1.html'
t = open(filename, "r").read()
categories = re.findall(r"<div class=\"comedy\">(.*?)</div>",t)
descs = re.findall(r"<p class=\"desc\">(.*?)</p>",t)
dates = re.findall(r"<p class=\"date\">(.*?)/p>",t)
# Print Your code here
print(categories)
print(descs)
print(dates)
the result:
['Category1', 'Category2']
['Title1', 'Title2']
['Date1', 'Date2']
but I noted that your html is not well formatted (<p class="date">Date2/p>
) I do it according to your example.
Upvotes: 1