Reputation: 149
I wrote a script that pulls paragraphs from articles and writes them to a file. For some articles, it won't pull every paragraph. This is where I am lost. Any guidance would be deeply appreciated. I have included a link to a particular article where it isn't pulling all of the information. It scrapes everything up until the first quoted sentence.
URL: http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306
# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")
# Open txt document for output
txt = open('ctp_output.txt', 'w')
# Parse HTML of article
soup = BeautifulSoup(urllib2.urlopen(url).read())
# retrieve all of the paragraph tags
tags = soup('p')
for tag in tags:
txt.write(tag.get_text() + '\n' + '\n')
Upvotes: 1
Views: 1981
Reputation: 473903
This is what works for me:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306"
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'w') as f:
for tag in soup.find_all('p'):
f.write(tag.text.encode('utf-8') + '\n')
Note that you should use with
context manager while working with files. Also you can pass urllib2.urlopen(url)
directly to the BeautifulSoup
constructor since urlopen
returns a file-like object.
Hope that helps.
Upvotes: 1