Reputation: 4052
I am scraping the Presidential debate transcripts. I noticed that when my scraper pulls the html elements it never pulls a paragraph-end tag (</p>
).
eg
Checking the source in browser
url_to_scrape = 'http://www.presidency.ucsb.edu/ws/index.php?pid=119039'
req = urllib.request.Request(url_to_scrape)
resp = urllib.request.urlopen(req)
resp.read()
I figure there are one of two things going on:
How do I figure out which one it is, and then correct for it?
Upvotes: 1
Views: 41
Reputation: 2796
Can you check the actual packet that Chrome received? In some circumstances, Chrome will detect and correct small omissions like this one in order to display the page, even if they're not in the packet. My guess is that Chrome fixed this, and the actual source is bad.
Upvotes: 2