joni
joni

Reputation: 57

Beautifulsoup extract multiple lines

I need some help and hope you can help me.

I have used mechanize to extact some data from a website. This has been processed to some output in a file. This file I would like to process some more, but here I've run into some problems.

The data looks like this:

    eek43"><a name="week43">Week 43</a></h2>
<div class="day"><h3 class="dayname">Monday</h3><div class="date">24/10/2016</div><div class="event" style="background-color: #58AA40"><a href="/course/view.php?id=16544">[E16] 1. sem / M1 - Psykiatri/psykologi</a><div class="teacher">Jane Doe</div><div class="time">Time: 08:15 - 12:00</div><div class="location">Location: KS5 lok. 47/49. GrpR:58,74,75,76,77,78,79,81,83</div><div class="note">Note: some notes</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Jannie Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: NJV 6A 1.50</div><div class="note">Note: Hold X2 some notes</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Jane Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: NJV 6A 1.50</div><div class="note">Note: Hold X2 - opsamling</div></div><div class="event" style="background-color: #58AA40"><a href="/course/view.php?id=16544">[E16] 1. sem / M1 - Psykiatri/psykologi</a><div class="teacher">Jannie Doe</div><div class="time">Time: 12:30 - 16:15</div><div class="location">Location: KS5 lok. 47/49.GrpR:58,74,75,76,77,78,79,81,83</div><div class="note">Note: some notes</div></div></div>
<div class="day"><h3 class="dayname">Tuesday</h3><div class="date">25/10/2016</div><div class="event" style="background-color: #5858FA"><a href="/course/view.php?id=16538">[E16] 1. sem / M1 - Socialt arbejde</a><div class="teacher">John Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: Fib 15. aud. B</div><div class="note">Note: Hold X&Y - Opsamling af profession og socialrådgiv</div></div><div class="event" style="background-color: #58AA40"><a href="/course/view.php?id=16544">[E16] 1. sem / M1 - Psykiatri/psykologi</a><div class="teacher">Jannie Doe</div><div class="time">Time: 10:15 - 14:15</div><div class="location">Location: NJV 8A, lok. 1.12 AUD</div><div class="note">Note: Hold X&Y - Perspektiver på psykiske lidelser...</div></div></div>
<div class="day"><h3 class="dayname">Wednesday</h3><div class="date">26/10/2016</div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">James Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: NJV 6A 1.50A</div><div class="note">Note: Hold Y1 - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">James Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: NJV 6A 1.50A</div><div class="note">Note: Hold Y2 - opsamling</div></div></div>
<div class="day"><h3 class="dayname">Thursday</h3><div class="date">27/10/2016</div></div>
<div class="day"><h3 class="dayname">Friday</h3><div class="date">28/10/2016</div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Johnny Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: Fib 13.053</div><div class="note">Note: Hold Y1a -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Lisa Andersen</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: Fib 13.047</div><div class="note">Note: Hold X1a - øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">John Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: Fib 13.049</div><div class="note">Note: Hold X2a -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Janine Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: Fib 13.055</div><div class="note">Note: Hold Y2a -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Jamie Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: Fib 13.047</div><div class="note">Note: Hold X1b -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">James Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: Fib 13.055</div><div class="note">Note: Hold Y2b -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Johnny Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: Fib 13.053</div><div class="note">Note: Hold Y1b -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">John Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: Fib 13.049</div><div class="note">Note: Hold X2b -  øvelser - opsamling</div></div></div>
<div class="day"><h3 class="dayname">Saturday</h3><div class="date">29/10/2016</div></div>
<div class="day"><h3 class="dayname">Sunday</h3><div class="date">30/10/2016</div></div>
<h2 class="week" id="

Ultimately I would like to make an output like this (all appointments who have a "note" containing X2 or X2a (not e.g. Y1)):

Monday  24/10/2016 
[E16] 1. sem / M1 - Psykiatri/psykologi     Jane Doe    Time: 08:15 - 12:00     Location: KS5 lok. 47/49.   Note: Hold X2 some notes 
[E16] 1. sem / M1 - Jura                    Jannie Doe  Time: 08:15 - 10:00     Location: NJV 6A 1.50       Note: Hold X2a some notes
[E16] 1. sem / M1 - Jura                    Jane Do     Time: 10:15 - 12:00     Location: NJV 6A 1.50       Note: Hold X2 - opsamling
...

Tuesday 25/10/2016
...

However if I run my code I only receive the first line:

[(u'Monday', u'24/10/2016', u'Jane Doe', u'Time: 08:15 - 12:00', u'Note: Hold X2 some notes'), (u'Monday', u'24/10/2016', u'Jane Doe', u'Time: 08:15 - 12:00', u'Note: Hold X2 some notes'), (u'Monday', u'24/10/2016', u'Jane Doe', u'Time: 08:15 - 12:00', u'Note: Hold X2 some notes'),...

Some of code:

data = [] 
soup = BeautifulSoup(open('scrape_out.txt'))

for lines in soup :  
    date = soup.find('div', attrs={'class': 'date'}).text.strip()
    day = soup.find('h3', attrs={'class': 'dayname'}).text.strip()
    teacher = soup.find('div', attrs={'class': 'teacher'}).text.strip()
    #lecture = soup.find('div', attrs={'a': })
    time = soup.find('div', attrs={'class': 'time'}).text.strip()
    location = soup.find('div', attrs={'class': 'location'}).text.strip()
    note = soup.find('div', attrs={'class': 'note'}).text.strip()

    data.append((day, date, teacher, time, note))

print data

I've tried a lot of different loops/iterations etc., but I only get this output (same line continues over and over again):

Anyone who are able to point me in the right direction (where I scrw up :) )

Thank you in advance.

Upvotes: 2

Views: 466

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180481

You need to iterate over the days:

h = """<div><h2 class="week43"><a name="week43">Week 43</a></h2>
<div class="day"><h3 class="dayname">Monday</h3><div class="date">24/10/2016</div><div class="event" style="background-color: #58AA40"><a href="/course/view.php?id=16544">[E16] 1. sem / M1 - Psykiatri/psykologi</a><div class="teacher">Jane Doe</div><div class="time">Time: 08:15 - 12:00</div><div class="location">Location: KS5 lok. 47/49. GrpR:58,74,75,76,77,78,79,81,83</div><div class="note">Note: some notes</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Jannie Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: NJV 6A 1.50</div><div class="note">Note: Hold X2 some notes</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Jane Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: NJV 6A 1.50</div><div class="note">Note: Hold X2 - opsamling</div></div><div class="event" style="background-color: #58AA40"><a href="/course/view.php?id=16544">[E16] 1. sem / M1 - Psykiatri/psykologi</a><div class="teacher">Jannie Doe</div><div class="time">Time: 12:30 - 16:15</div><div class="location">Location: KS5 lok. 47/49.GrpR:58,74,75,76,77,78,79,81,83</div><div class="note">Note: some notes</div></div></div>
<div class="day"><h3 class="dayname">Tuesday</h3><div class="date">25/10/2016</div><div class="event" style="background-color: #5858FA"><a href="/course/view.php?id=16538">[E16] 1. sem / M1 - Socialt arbejde</a><div class="teacher">John Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: Fib 15. aud. B</div><div class="note">Note: Hold X&Y - Opsamling af profession og socialrådgiv</div></div><div class="event" style="background-color: #58AA40"><a href="/course/view.php?id=16544">[E16] 1. sem / M1 - Psykiatri/psykologi</a><div class="teacher">Jannie Doe</div><div class="time">Time: 10:15 - 14:15</div><div class="location">Location: NJV 8A, lok. 1.12 AUD</div><div class="note">Note: Hold X&Y - Perspektiver på psykiske lidelser...</div></div></div>
<div class="day"><h3 class="dayname">Wednesday</h3><div class="date">26/10/2016</div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">James Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: NJV 6A 1.50A</div><div class="note">Note: Hold Y1 - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">James Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: NJV 6A 1.50A</div><div class="note">Note: Hold Y2 - opsamling</div></div></div>
<div class="day"><h3 class="dayname">Thursday</h3><div class="date">27/10/2016</div></div>
<div class="day"><h3 class="dayname">Friday</h3><div class="date">28/10/2016</div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Johnny Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: Fib 13.053</div><div class="note">Note: Hold Y1a -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Lisa Andersen</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: Fib 13.047</div><div class="note">Note: Hold X1a - øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">John Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: Fib 13.049</div><div class="note">Note: Hold X2a -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Janine Doe</div><div class="time">Time: 08:15 - 10:00</div><div class="location">Location: Fib 13.055</div><div class="note">Note: Hold Y2a -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Jamie Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: Fib 13.047</div><div class="note">Note: Hold X1b -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">James Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: Fib 13.055</div><div class="note">Note: Hold Y2b -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">Johnny Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: Fib 13.053</div><div class="note">Note: Hold Y1b -  øvelser - opsamling</div></div><div class="event" style="background-color: #ACFA58"><a href="/course/view.php?id=16533">[E16] 1. sem / M1 - Jura</a><div class="teacher">John Doe</div><div class="time">Time: 10:15 - 12:00</div><div class="location">Location: Fib 13.049</div><div class="note">Note: Hold X2b -  øvelser - opsamling</div></div></div>
<div class="day"><h3 class="dayname">Saturday</h3><div class="date">29/10/2016</div></div>
<div class="day"><h3 class="dayname">Sunday</h3><div class="date">30/10/2016</div></div>
</div>
"""
from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(h, "lxml")
for d in soup.find_all("div", class_="day"):
    notes = d.find_all("div", class_="note")
    teachers = d.find_all("div",class_="teacher")
    date = d.find("div", class_="date")
    times = d.find_all("div", class_="time")
    day = d.find("h3",class_="dayname")
    for note,time,   teacher in zip(notes,times,  teachers):
        note_text = note.text
        if "X2" in note_text:
            print((day.text, date.text, teacher.text,time.text, note.text))

Which will give you:

('Monday', '24/10/2016', 'Jannie Doe', 'Time: 08:15 - 10:00', 'Note: Hold X2 some notes')
('Monday', '24/10/2016', 'Jane Doe', 'Time: 10:15 - 12:00', 'Note: Hold X2 - opsamling')
('Friday', '28/10/2016', 'John Doe', 'Time: 08:15 - 10:00', 'Note: Hold X2a -  øvelser - opsamling')
('Friday', '28/10/2016', 'John Doe', 'Time: 10:15 - 12:00', 'Note: Hold X2b -  øvelser - opsamling')

If you want to group each by a week you need to add a find_all call looking for whatever the parent element is that contains all the weeks.

To write to a file, you can use the csv lib:

from csv import writer

with open("data.csv", "w") as f:
    wr = csv.writer(f)
    # write column names
    wr.writerow(["Day", "Date", "Teacher", "Note"])
    for d in soup.find_all("div", class_="day"):
        notes = d.find_all("div", class_="note")
        teachers = d.find_all("div",class_="teacher")
        date = d.find("div", class_="date")
        times = d.find_all("div", class_="time")
        day = d.find("h3",class_="dayname")
        for note,time,   teacher in zip(notes,times,  teachers):
            note_text = note.text
            if "X2" in note_text:
               # write each group on new row
                wr.writerow((day.text, date.text, teacher.text,time.text, note.text))

Upvotes: 1

Related Questions