Reputation: 179
Newspaper is a fantastic library that allows scraping web data however I am a little confused with article caching. It caches the article to speed up operations but how do I access those articles?
I have something like this. Now when I run this command twice with the same set of articles, I get the return type None
the second time. How do I access those previously cached articles for processing?
newspaper_articles = [Article(url) for url in links]
Upvotes: 4
Views: 1769
Reputation: 1160
After check from source code, It depends.
DATA_DIRECTORY = '.newspaper_scraper'
TOP_DIRECTORY = os.path.join(tempfile.gettempdir(), DATA_DIRECTORY)
so run this in your python interpreter to get locations of cache
import tempfile
tempfile.gettempdir()
Upvotes: 0
Reputation: 253
Looking at this: https://github.com/codelucas/newspaper/issues/481 it seems the caching method 'cache_disk' in https://github.com/codelucas/newspaper/blob/master/newspaper/utils.py may have a bug. It indeed does cache the results to disk (search for a folder '.newspaper_scraper'), but doesn't load them afterwards.
A workaround is to set memoize_articles=False when building your newspaper, or using the Config class.
newspaper.build(url, memoize_articles=False)
Upvotes: 1