John
John

Reputation: 2147

Does the browser instance from mechanize cache?

I am doing some webscraping with the mechanize browser and using the following code. I realized in some cases we keep getting the same page, although the remote page is already changed. So my question is:

  1. Does mechanize broswer instance cache page by default (in some config)?
  2. If so, how can we change it, or is there a way to avoid caching (apart from creating the browser instance every time in the loop we webscrape)

    # put int login detail and submit, return a mechanize.Browser instance 
    browser = _login() 
    # main loop
    while True:
        rsp = browser.open(URL)
        html = rsp.read()
    

thanks

Upvotes: 0

Views: 1424

Answers (1)

interpolack
interpolack

Reputation: 914

According to this thread,

Mechanize instances do cache pages you've visited, but you can clear that with agent.history.clear; or prevent history from being saved by setting agent.history.max_size = 0. Or, you can use a new Mechanize instance altogether.

Particularly,

Currently Mechanize reuses pages in the history of the session if a request with an If-Modified-Since header results in 304 Not Modified.

And by the documentation here, in Python, the following code will prevent the caching-like behavior (seekable responses):

import mechanize
ua = mechanize.UserAgent()
ua.set_seekable_responses(False)
ua.set_handle_equiv(False)
ua.set_debug_responses(False)

Hope that provides some insight.

Upvotes: 3

Related Questions