yuk
yuk

Reputation: 19870

Saving html page from MATLAB web browser

Following this question I get a message on the retrieved page that "Your browser does not support JavaScript so some functionality may be missing!"

If I open this page with web(url) in MATLAB web browser and accept certificate (once per session), the page opens properly.

How can I save the page source from the browser with a script? Or from system browser? Or may be there is a way to get that page even without browser?

url='https://cgwb.nci.nih.gov/cgi-bin/hgTracks?position=chr7:55054218-55242525';

Upvotes: 1

Views: 2915

Answers (2)

ckg
ckg

Reputation: 2380

Will saving cookies be sufficient for solving your problem? wget can do that with --keep-session-cookies and --save-cookies filename; then you use --load-cookies filename to get your cookies back on subsequent requests. Something like the following (note I have not tested this from Matlab, so quoting, etc, might not be exactly right, but I do use a similar shell construction in other contexts):

command_init = ['wget --no-check-certificate \
                      --page-requisites \
                      --keep-session-cookies \
                      --save-cookies cookie_file.txt \
                      --post-data \'user=X&pass=Y&whatever=TRUE\'' \
                      init_url];
command_get  = ['wget --no-check-certificate \
                      --page-requisites \
                      --load-cookies cookie_file.txt' \
                      url];

If you don't have any post-data, but rather subsequent gets will update cookies, you can simply use keep and save on successive get requests.

Upvotes: 1

Amro
Amro

Reputation: 124563

From what I could tell the page source gets downloaded just fine, just make sure to let Javascript run when you open the saved page locally.

[...]
<script type='text/javascript' src='../js/hgTracks.js'></script>
<noscript><b>Your browser does not support JavaScript so some functionality may be missing!</b></noscript>
[...]

Note that the solution you are using only downloads the web page without any of the attached stuff (images, .css, .js, etc..).

What you can do is call wget to get the page with all of its files:

url = 'https://cgwb.nci.nih.gov/cgi-bin/hgTracks?position=chr7:55054218-55242525';
command = ['wget --no-check-certificate --page-requisites ' url];
system( command );

If you are on a Windows machine, you can always get wget from the GnuWin32 project or from one of the many other implementations.

Upvotes: 2

Related Questions