Reputation: 127
I need to scrape some information from a very challenging website
This is an example:
<div class="overview">
<span class="course_titles">Courses:</span>
<a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
<a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17,
<a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ),
<a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12);
<a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16,
<a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17,
<a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ),
</div>
Each course has specific students with their age given after their name (and those random characters were already in there).
I need to scrape each course with their respective students, plus age.
Unfortunately, there is no inherent hierarchy besides the all encompassing div-class. I tried scraping with BeautifulSoup by "course_name" and then add all items that has the "coursestudent_name" attribute, but this way I add all students present to each course.
I wish I could change the website, but I cannot. Anyone have an idea how I could get the information per course with the correct students?
Thank you!
Upvotes: 2
Views: 78
Reputation: 180441
You don't need a regex, you can simply parse the anchor tags to get the name and call next_sibling
to get the age text splitting and stripping to get the age text, finding the course_name
previous to the coursestudent
will also give you the relevant course:
h = """<div class="overview">
<span class="course_titles">Courses:</span>
<a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
<a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17,
<a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ),
<a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12);
<a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16,
<a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17,
<a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ),
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
data = [[a.find_previous("a", "course_name").text ,a.text, a.next_sibling.split()[0].strip(",")] for a in soup.select("div.overview a.coursestudent_name")]
[[u'Math101', u'Mark', u'17'], [u'Math101', u'Alex', u'18'], [u'English101', u'Sarah', u'16'], [u'English101', u'Nancy', u'17'], [u'English101', u'Casey', u'17']]
Upvotes: 0
Reputation: 2088
If you could modify your question to let us know what you're looking for exactly. But, here's a basic example of how you could grab the data from this page.
from bs4 import BeautifulSoup
import re
html = '''<div class="overview">
<span class="course_titles">Courses:</span>
<a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
<a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17,
<a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ),
<a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12);
<a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16,
<a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17,
<a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ),
</div>'''
soup = BeautifulSoup(html)
all_links = soup.find_all('a')
dict_courseinfo = {}
dict_key = ''
stu_lst = []
for n, link in enumerate(all_links):
if link.get('class')[0] == 'course_name':
if n > 0:
dict_courseinfo[dict_key] = stu_lst
stu_lst = []
dict_key = str(link.text)
else:
age = int(re.search(link.text + r"</a> (\d+)", html).group(1))
stu_lst.append((str(link.text), age))
dict_courseinfo[dict_key] = stu_lst
print dict_courseinfo
Which will output:
{'Math101': [('Mark', 17), ('Alex', 18)], 'English101': [('Sarah', 16), ('Nancy', 17), ('Casey', 17)]}
Upvotes: 0
Reputation: 4341
You can do it mostly BeautifulSoup then a tiny bit of regex to get the the student age that isn't inside any html tags
soup = BeautifulSoup(html, "html.parser")
allA = soup.find("div", {"class" : "overview"}).find_all("a")
classInfo = {}
currentClass = None
for item in allA:
if item['class'] == ['course_name']:
classInfo[item.text] = []
currentClass = item.text
else:
classInfo[currentClass] += [(item.text, int(re.search(item.text + r"</a> (\d+)", html).group(1)))]
print(classInfo)
This outputs:
{'English101': [('Sarah', 16), ('Nancy', 17), ('Casey', 17)], 'Math101': [('Mark', 17), ('Alex', 18)]}
Upvotes: 1