What articles does the newspaper package of Python return?

Question

My basic question is how does the newspaper package in Python determine what urls/articles it returns? One would think it simply returns all of the article links contained on the url you provide it but it doesn't seem to work that way. As an example, if you use "http://www.cnn.com" and "https://www.cnn.com/politics" you get the exact same articles returned. I would think for the latter you should only get articles on the politics page, but that does not seem to be the case.

So what is it actually doing? Is it just getting all of the articles from the homepage?

Here's an example I used to test this (I used python version 3.6.2):

import newspaper

#Build newspaper on cnn homepage
url = "http://www.cnn.com"
paper = newspaper.build(url, memoize_articles=False)
article_list = []
for article in paper.articles:
    article_list.append(article.url)

#Build newspaper on cnn politics page
url = "https://www.cnn.com/politics"
paper = newspaper.build(url, memoize_articles=False)
article_list_2 = []
for article in paper.articles:
    article_list_2.append(article.url)

#print the total number of urls returned
print (str(len(article_list)))
print (str(len(article_list_2)))

What articles does the newspaper package of Python return?

Answers (1)

Related Questions