user2982049
user2982049

Reputation:

Dont Want Spaces Between Paragraphs : Python

I am web scraping a news website to get news articles by using the following code :

import mechanize
from selenium import webdriver
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2012/06/19/"

link_dictionary = {}
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)

for tag_li in soup.findAll('li', attrs={"data-section":"Editorial"}):
    for link in tag_li.findAll('a'):
        link_dictionary[link.string] = link.get('href')
        print link_dictionary[link.string]
        urlnew = link_dictionary[link.string]

        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()

        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text
        print articletext


driver.close()

I am getting the desired result but I want a particular news article in a single line. For some articles, I am getting the whole article in a single line while in others I am getting different paragraphs. Can someone help me to sort out the issue ?? I am new to python programming. Thanks and Regards.

Upvotes: 4

Views: 84

Answers (1)

apg
apg

Reputation: 2649

This is likely related to the way whitespace is managed in the particular site's HTML, and the fact that not all sites will use "p" tags for their content. Your best bet is to probably do a regular expression replace which eliminates the extra spaces (including newlines).

At the beginning of your file, import the regular expression module:

import re

Then after you've built your articletext, add the following code:

print re.sub('\s+', ' ', articletext, flags=re.M)

You might also want to extract the text from other elements that might be contained within.

Upvotes: 1

Related Questions