Reputation:
I am web scraping a news website to get news articles by using the following code :
import mechanize
from selenium import webdriver
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2012/06/19/"
link_dictionary = {}
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
for tag_li in soup.findAll('li', attrs={"data-section":"Editorial"}):
for link in tag_li.findAll('a'):
link_dictionary[link.string] = link.get('href')
print link_dictionary[link.string]
urlnew = link_dictionary[link.string]
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print articletext
driver.close()
I am getting the desired result but I want a particular news article in a single line. For some articles, I am getting the whole article in a single line while in others I am getting different paragraphs. Can someone help me to sort out the issue ?? I am new to python programming. Thanks and Regards.
Upvotes: 4
Views: 84
Reputation: 2649
This is likely related to the way whitespace is managed in the particular site's HTML, and the fact that not all sites will use "p" tags for their content. Your best bet is to probably do a regular expression replace which eliminates the extra spaces (including newlines).
At the beginning of your file, import the regular expression module:
import re
Then after you've built your articletext, add the following code:
print re.sub('\s+', ' ', articletext, flags=re.M)
You might also want to extract the text from other elements that might be contained within.
Upvotes: 1