Urllib returning html but no closing paragraph tags

Question

I am scraping the Presidential debate transcripts. I noticed that when my scraper pulls the html elements it never pulls a paragraph-end tag (

).

eg

Checking the source in browser

url_to_scrape = 'http://www.presidency.ucsb.edu/ws/index.php?pid=119039'
req = urllib.request.Request(url_to_scrape)
resp = urllib.request.urlopen(req)
resp.read()

I figure there are one of two things going on:

urllib is somehow dropping closing tags (for just paragraphs, the rest are fine)
The raw source doesn't include closing tags, and the browser is filling them in.

How do I figure out which one it is, and then correct for it?

David Culbreth · Accepted Answer

Can you check the actual packet that Chrome received? In some circumstances, Chrome will detect and correct small omissions like this one in order to display the page, even if they're not in the packet. My guess is that Chrome fixed this, and the actual source is bad.

Urllib returning html but no closing paragraph tags

Answers (1)

Related Questions