Reputation: 2147
I am doing some webscraping with the mechanize browser and using the following code. I realized in some cases we keep getting the same page, although the remote page is already changed. So my question is:
If so, how can we change it, or is there a way to avoid caching (apart from creating the browser instance every time in the loop we webscrape)
# put int login detail and submit, return a mechanize.Browser instance
browser = _login()
# main loop
while True:
rsp = browser.open(URL)
html = rsp.read()
thanks
Upvotes: 0
Views: 1424
Reputation: 914
According to this thread,
Mechanize instances do cache pages you've visited, but you can clear that with agent.history.clear; or prevent history from being saved by setting agent.history.max_size = 0. Or, you can use a new Mechanize instance altogether.
Particularly,
Currently Mechanize reuses pages in the history of the session if a request with an If-Modified-Since header results in 304 Not Modified.
And by the documentation here, in Python, the following code will prevent the caching-like behavior (seekable responses):
import mechanize
ua = mechanize.UserAgent()
ua.set_seekable_responses(False)
ua.set_handle_equiv(False)
ua.set_debug_responses(False)
Hope that provides some insight.
Upvotes: 3