Reputation: 21
My basic question is how does the newspaper package in Python determine what urls/articles it returns? One would think it simply returns all of the article links contained on the url you provide it but it doesn't seem to work that way. As an example, if you use "http://www.cnn.com" and "https://www.cnn.com/politics" you get the exact same articles returned. I would think for the latter you should only get articles on the politics page, but that does not seem to be the case.
So what is it actually doing? Is it just getting all of the articles from the homepage?
Here's an example I used to test this (I used python version 3.6.2):
import newspaper
#Build newspaper on cnn homepage
url = "http://www.cnn.com"
paper = newspaper.build(url, memoize_articles=False)
article_list = []
for article in paper.articles:
article_list.append(article.url)
#Build newspaper on cnn politics page
url = "https://www.cnn.com/politics"
paper = newspaper.build(url, memoize_articles=False)
article_list_2 = []
for article in paper.articles:
article_list_2.append(article.url)
#print the total number of urls returned
print (str(len(article_list)))
print (str(len(article_list_2)))
Upvotes: 1
Views: 502
Reputation: 388
Python newspaper package for Article scraping and curation returns only Home page articles.
import newspaper
news_paper = newspaper.build('http://nypost.com', memoize_articles=False)
print(news_paper.size())
for article in news_paper.articles:
print(article.url)
It will print all the article urls of home page.I also tested it for CNN 'https://edition.cnn.com'.
Upvotes: 2