Reputation: 1402
Basically I have a large html document that I would like to scrape. A very simplified example of a similar document is as follows:
<a name = 'ID_0'></a>
<span class='c2'>Date</span>
<span class='c2'>December 12,2005</span>
<span class='c2'>Source</span>
<span class='c2'>NY Times</span>
<span class='c2'>Author</span>
<span class='c2'>John</span>
<a name = 'ID_1'></a>
<span class='c2'>Date</span>
<span class='c2'>January 21,2008</span>
<span class='c2'>Source</span>
<span class='c2'>LA Times</span>
<a name = 'ID_2'></a>
<span class='c2'>Source</span>
<span class='c2'>Wall Street Journal</span>
<span class='c2'>Author</span>
<span class='c2'>Jane</span>
The document has roughly 3500 'a' tags and at first I thought that each would have identical layouts. So, I wrote something along the lines of:
a_list = soup.find_all('a')
data2D = []
for i in range(0,len(a_list)):
data=[]
data.append(a_list[i]['name'])
data.append(a_list[i].find_next(text='Date').find_next().text)
data.append(a_list[i].find_next(text='Source').find_next().text)
data.append(a_list[i].find_next(text='Author').find_next().text)
data2D.append(data)
However, since some IDs are missing Authors or Dates, the scraper takes the next available Author or Date which would be from the next ID. ID_1 would have ID_2 Author. ID_2 would have ID_3 Date. My first thought was to somehow keep track of the indexes at each tag and if an index exceeds the next 'a' tag index, then append null. Is there a better solution?
Upvotes: 2
Views: 114
Reputation: 474221
Instead of find_next()
, I would use .find_next_siblings()
(or .find_all_next()
) and get all the tags until the next a
link or the end of the document. Something along these lines:
links = soup.find_all('a', {"name": True})
data = []
columns = set(['Date', 'Source', 'Author'])
for link in links:
item = [link["name"]]
for elm in link.find_next_siblings():
if elm.name == "a":
break # hit the next "a" element - break
if elm.text in columns:
item.append(elm.find_next().text)
data.append(item)
Upvotes: 1