Python Web Scraping Issue

Question

Basically I have a large html document that I would like to scrape. A very simplified example of a similar document is as follows:


Date
December 12,2005
Source
NY Times
Author
John


Date
January 21,2008
Source
LA Times


Source
Wall Street Journal
Author
Jane

The document has roughly 3500 'a' tags and at first I thought that each would have identical layouts. So, I wrote something along the lines of:

a_list = soup.find_all('a')
data2D = []
for i in range(0,len(a_list)):
    data=[]
    data.append(a_list[i]['name'])
    data.append(a_list[i].find_next(text='Date').find_next().text)
    data.append(a_list[i].find_next(text='Source').find_next().text)
    data.append(a_list[i].find_next(text='Author').find_next().text)
    data2D.append(data)

However, since some IDs are missing Authors or Dates, the scraper takes the next available Author or Date which would be from the next ID. ID_1 would have ID_2 Author. ID_2 would have ID_3 Date. My first thought was to somehow keep track of the indexes at each tag and if an index exceeds the next 'a' tag index, then append null. Is there a better solution?

alecxe · Accepted Answer

Instead of find_next(), I would use .find_next_siblings() (or .find_all_next()) and get all the tags until the next a link or the end of the document. Something along these lines:

links = soup.find_all('a', {"name": True})
data = []
columns = set(['Date', 'Source', 'Author'])

for link in links:
    item = [link["name"]]
    for elm in link.find_next_siblings():
        if elm.name == "a":
            break  # hit the next "a" element - break

        if elm.text in columns:
            item.append(elm.find_next().text)

     data.append(item)

Python Web Scraping Issue

Answers (1)

Related Questions