Crawler with BeautifulSoup

Question

I am trying to create a web crawler for student research. I have already finish it, but I would like to tell me if the way I use is the best one. (probably it isn't :p)

The crawler is for the cnn site and the only thing I want to get, is the text of the news.

Here is an example link: link

Here is my code:

def cnn_crawler(link):
    req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"}) 
    usock = urllib2.urlopen(req)
    encoding = usock.headers.getparam('charset')
    page = usock.read().decode(encoding)
    usock.close()

    soup = BeautifulSoup(page)
    div = soup.find('div', attrs={'class': 'cnn_strycntntlft'})
    text = div.find_all('p')
    text.remove(soup.find('p', attrs={'class': 'cnn_strycbftrtxt'}))
    final = ""
    for entry in text:
            final = final + entry.get_text() + " "
    return final

Gunjan · Accepted Answer

You can try using Goose packege if its just for text extraction

https://github.com/grangier/python-goose

here is the link. it works perfect if u just need text

Crawler with BeautifulSoup

Answers (1)

Related Questions