ScottieB
ScottieB

Reputation: 4052

Urllib returning html but no closing paragraph tags

I am scraping the Presidential debate transcripts. I noticed that when my scraper pulls the html elements it never pulls a paragraph-end tag (</p>).

eg

Checking the source in browser from Chrome's View > Developer > View source

url_to_scrape = 'http://www.presidency.ucsb.edu/ws/index.php?pid=119039'
req = urllib.request.Request(url_to_scrape)
resp = urllib.request.urlopen(req)
resp.read()

Python results

I figure there are one of two things going on:

  1. urllib is somehow dropping closing tags (for just paragraphs, the rest are fine)
  2. The raw source doesn't include closing tags, and the browser is filling them in.

How do I figure out which one it is, and then correct for it?

Upvotes: 1

Views: 41

Answers (1)

David Culbreth
David Culbreth

Reputation: 2796

Can you check the actual packet that Chrome received? In some circumstances, Chrome will detect and correct small omissions like this one in order to display the page, even if they're not in the packet. My guess is that Chrome fixed this, and the actual source is bad.

Upvotes: 2

Related Questions