Dont Want Spaces Between Paragraphs : Python

Question

I am web scraping a news website to get news articles by using the following code :

import mechanize
from selenium import webdriver
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2012/06/19/"

link_dictionary = {}
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)

for tag_li in soup.findAll('li', attrs={"data-section":"Editorial"}):
    for link in tag_li.findAll('a'):
        link_dictionary[link.string] = link.get('href')
        print link_dictionary[link.string]
        urlnew = link_dictionary[link.string]

        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()

        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text
        print articletext


driver.close()

I am getting the desired result but I want a particular news article in a single line. For some articles, I am getting the whole article in a single line while in others I am getting different paragraphs. Can someone help me to sort out the issue ?? I am new to python programming. Thanks and Regards.

apg · Accepted Answer

This is likely related to the way whitespace is managed in the particular site's HTML, and the fact that not all sites will use "p" tags for their content. Your best bet is to probably do a regular expression replace which eliminates the extra spaces (including newlines).

At the beginning of your file, import the regular expression module:

import re

Then after you've built your articletext, add the following code:

print re.sub('\s+', ' ', articletext, flags=re.M)

You might also want to extract the text from other elements that might be contained within.

Dont Want Spaces Between Paragraphs : Python

Answers (1)

Related Questions