Naman
Naman

Reputation: 179

How to access cached articles in newspaper3k

Newspaper is a fantastic library that allows scraping web data however I am a little confused with article caching. It caches the article to speed up operations but how do I access those articles?

I have something like this. Now when I run this command twice with the same set of articles, I get the return type None the second time. How do I access those previously cached articles for processing?

newspaper_articles = [Article(url) for url in links]

Upvotes: 4

Views: 1769

Answers (2)

faisal burhanudin
faisal burhanudin

Reputation: 1160

After check from source code, It depends.

https://github.com/codelucas/newspaper/blob/beacce0e167349374ce0b37012b01c7c07a26890/newspaper/settings.py#L35

DATA_DIRECTORY = '.newspaper_scraper'

TOP_DIRECTORY = os.path.join(tempfile.gettempdir(), DATA_DIRECTORY)

so run this in your python interpreter to get locations of cache

import tempfile
tempfile.gettempdir()

Upvotes: 0

Looking at this: https://github.com/codelucas/newspaper/issues/481 it seems the caching method 'cache_disk' in https://github.com/codelucas/newspaper/blob/master/newspaper/utils.py may have a bug. It indeed does cache the results to disk (search for a folder '.newspaper_scraper'), but doesn't load them afterwards.

A workaround is to set memoize_articles=False when building your newspaper, or using the Config class.

newspaper.build(url, memoize_articles=False)

Upvotes: 1

Related Questions