StandardIO
StandardIO

Reputation: 336

When I use request.get() get the wrong answer

I don't know too much about web. I only want to download all the zip files from a web page with a python script. But when I made the request.get() I only got a pre-page with code to load the real page (that is what I think). Is there any way to load the correct content?. My pipe line overview is:

  1. Load the page with request.get(),
  2. Pass the info to Beautifulsoup4 to obtain all the urls to download.

The web page is link

I could copy the html web info directly from the DOM, but, I really want to know what I'm doing wrong with the request command :(

page =requests.get('https://divvy-tripdata.s3.amazonaws.com/index.html')
soup = BeautifulSoup(page.content)
soup.prettify()
print(soup)

What I got:

<html>
<head>
<!--

  Amazon S3 Bucket listing.


  Copyright (C) 2008 Francesco Pasqualini

      This program is free software: you can redistribute it and/or modify
      it under the terms of the GNU General Public License as published by
      the Free Software Foundation, either version 3 of the License, or
      (at your option) any later version.

      This program is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
      GNU General Public License for more details.

      You should have received a copy of the GNU General Public License
      along with this program.  If not, see <http://www.gnu.org/licenses/>.

  -->
<!--

  Modified by Nolan Lawson!  (http://nolanlawson.com).  I'm keeping the spirit of the
  GPL alive by issuing this with the same license!

  -->
<title>Bucket loading...</title>
<link href="//netdna.bootstrapcdn.com/bootstrap/2.3.2/css/bootstrap.min.css" rel="stylesheet"/>
<style>
        .hide-while-loading {
          display:none;
        }
        .i-expand-collapse {
          opacity: 0.3;
        }
        .i-file-or-folder {
          margin-right: 4px;
        }
      </style>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/handlebars.js/1.1.2/handlebars.min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/moment.js/2.4.0/moment.min.js"></script>
</head>
<body>
<div class="container">
<h1 id="h1-title">Bucket loading...</h1>
<table class="hide-while-loading table table-striped">
<thead>
<tr>
<th>Name</th>
<th>Date Modified</th>
<th>Size</th>
<th>Type</th>
</tr>
</thead>
<tbody id="tbody-content">
</tbody>
</table>
</div>
<script id="file-or-folder" type="text/x-handlebars-template">
    <tr>
      {{#if isFolder}}
        <td><i class="icon-chevron-down i-expand-collapse" style="margin-left:calc(({{numLevels}} - 1) * 16px)");></i><i class="icon-folder-open i-file-or-folder" style="margin-left:4px;"></i>
        {{simpleFilename}}</td>
      {{else}}
        <td><i class="icon-file i-file-or-folder"  style="margin-left:calc(({{numLevels}} * 16px) + 4px);"></i>
        <a href="{{url}}">{{simpleFilename}}</a></td>
      {{/if}}
      <td>{{friendlyLastModified}}</td>

      <td>{{friendlySizeName}}</td>
      <td>{{type}}</td>
    </tr>
  </script>
<script>
    (function($){
      "use strict";
      var FOLDER_PATTERN = new RegExp('_\\$folder\\$$');
      var TYPE_PATTERN = new RegExp('\\.([^\\.\\s]{1,10})$');
        var KB = 1024;
        var MB = 1000000;
        var GB = 1000000000;

    // replace last /index.html to get bucket root
      var bucketUrl = document.location.href.replace(/\/[^\/]+$/, '');
        var compiledTemplate;

    // return e.g. 1.2KB, 1.3MB, 2GB, etc.
      function toFriendlySizeName(size){
        if (size === 0) {
          return '';
        } else if (size < KB) {
          return size + ' B';
        } else if (size < MB) {
          return (size / KB).toFixed(0) + ' KB';
        } else if (size < GB) {
          return (size / MB).toFixed(2) + ' MB';
        }
        return (size / GB).toFixed(2) + ' GB';
      }


      // POJO describing a file or a folder
      function FileOrFolder(lastModified, etag, size, key){
        var self = this;

        self.lastModified = lastModified;
        self.etag = etag;
        self.size = size;
        self.key = key;

        // init logic
        self.isFolder = FOLDER_PATTERN.test(self.key);
        self.filename = self.isFolder ? self.key.replace(FOLDER_PATTERN,'') : self.key;
        self.url = bucketUrl + '/' + self.key;
        self.levels = self.filename.split('/');
        self.numLevels = self.levels.length;
        self.simpleFilename = self.levels[self.numLevels - 1];
        self.friendlySizeName = toFriendlySizeName(parseInt(self.size,10));
        var foundTypes = TYPE_PATTERN.exec(self.simpleFilename);
        self.type = self.isFolder ? 'Folder ' : (foundTypes ? (foundTypes[1].toUpperCase() + ' file') : 'Unknown');
        self.friendlyLastModified = moment(lastModified).format('MMM Do YYYY, hh:mm:ss a');
      }

        function onAjaxSuccess(xml) {
            var listBucketResult = $(xml).find('ListBucketResult');

            // set a reasonable title instead of "Bucket loading"
            var title = 'Index of bucket "' + listBucketResult.find('Name').text() + '"';
            document.title = title;
            $('#h1-title').text(title);

            var $tbodyContent = $('#tbody-content');

            // create the file or folder objects

            var filesOrFolders = [];

            listBucketResult.find('Contents').each(function(idx, element){

                var $element = $(element);

                var fileOrFolder = new FileOrFolder(
                     $element.find('LastModified').text(),
                     $element.find('ETag').text(),
                     $element.find('Size').text(),
                     $element.find('Key').text()
                );

                filesOrFolders.push(fileOrFolder);
            });

            // sort
            filesOrFolders.sort(function(left, right){
                if (left.levels === right.levels) {
                    return 0;
                } else if (left.levels < right.levels) {
                    return -1;
                }
                return 1;
            });

            // fill in the rows
            var str = '';
            for (var i = 0; i < filesOrFolders.length; i ++) {
                str += compiledTemplate(filesOrFolders[i]);
            }
            $tbodyContent.append(str);
            $('.hide-while-loading').show();
        }

        $.ajax({
         url: bucketUrl,
         success: onAjaxSuccess
        });

    // compile while ajax is in progress
        compiledTemplate = Handlebars.compile($('#file-or-folder').html());

    })(jQuery);
    </script>
</body>
</html>

Upvotes: 2

Views: 150

Answers (1)

StandardIO
StandardIO

Reputation: 336

After read the discussion provided by @colidyre.

I use requests_html library to request the get petition. This library download chromium web explorer to the pc when you use render method for first time. This method execute the javascript in chromium to render the page completely.

This library has two version class for this endevour:

  • syncronous
  • asyncronous.

I had to use the async one. The syncronous version can be found in the docs.

This is simple and other methods implies to install a server but for me that was an overkill because this is not a frecuent operation to me.

# Libraries

from requests_html import AsyncHTMLSession
import requests

# Session and request

asession = AsyncHTMLSession()
r = await asession.get('https://divvy-tripdata.s3.amazonaws.com/index.html')
await r.html.arender(sleep=1) # The sleep arg is necessary I don't know why...
r.close()

# Processing and saving to a file the links

links = r.html.links

dir_path = "data/"
path_file = dir_path + "url_files.txt" 

with  open(path_file, mode='w') as url_files:
    for link in links:
        if link.split('.')[-1] == 'zip':
            url_files.write(link + '\n')

# Download data

with open(path_file, mode='r') as url_file:
    for link in url_file:
        link = link[0:-1] # rid the \n character
        response = requests.get(link)
        file_name = link.split('/')[-1]
        with open(dir_path + file_name, mode='wb') as zipfile:
            zipfile.write(response.content)
        print(f'succcesful downloaded file: {file_name}')

Upvotes: 1

Related Questions