Reputation: 336
I don't know too much about web. I only want to download all the zip files from a web page with a python script. But when I made the request.get()
I only got a pre-page with code to load the real page (that is what I think). Is there any way to load the correct content?.
My pipe line overview is:
request.get()
,The web page is link
I could copy the html web info directly from the DOM, but, I really want to know what I'm doing wrong with the request command :(
page =requests.get('https://divvy-tripdata.s3.amazonaws.com/index.html')
soup = BeautifulSoup(page.content)
soup.prettify()
print(soup)
What I got:
<html>
<head>
<!--
Amazon S3 Bucket listing.
Copyright (C) 2008 Francesco Pasqualini
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
-->
<!--
Modified by Nolan Lawson! (http://nolanlawson.com). I'm keeping the spirit of the
GPL alive by issuing this with the same license!
-->
<title>Bucket loading...</title>
<link href="//netdna.bootstrapcdn.com/bootstrap/2.3.2/css/bootstrap.min.css" rel="stylesheet"/>
<style>
.hide-while-loading {
display:none;
}
.i-expand-collapse {
opacity: 0.3;
}
.i-file-or-folder {
margin-right: 4px;
}
</style>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/handlebars.js/1.1.2/handlebars.min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/moment.js/2.4.0/moment.min.js"></script>
</head>
<body>
<div class="container">
<h1 id="h1-title">Bucket loading...</h1>
<table class="hide-while-loading table table-striped">
<thead>
<tr>
<th>Name</th>
<th>Date Modified</th>
<th>Size</th>
<th>Type</th>
</tr>
</thead>
<tbody id="tbody-content">
</tbody>
</table>
</div>
<script id="file-or-folder" type="text/x-handlebars-template">
<tr>
{{#if isFolder}}
<td><i class="icon-chevron-down i-expand-collapse" style="margin-left:calc(({{numLevels}} - 1) * 16px)");></i><i class="icon-folder-open i-file-or-folder" style="margin-left:4px;"></i>
{{simpleFilename}}</td>
{{else}}
<td><i class="icon-file i-file-or-folder" style="margin-left:calc(({{numLevels}} * 16px) + 4px);"></i>
<a href="{{url}}">{{simpleFilename}}</a></td>
{{/if}}
<td>{{friendlyLastModified}}</td>
<td>{{friendlySizeName}}</td>
<td>{{type}}</td>
</tr>
</script>
<script>
(function($){
"use strict";
var FOLDER_PATTERN = new RegExp('_\\$folder\\$$');
var TYPE_PATTERN = new RegExp('\\.([^\\.\\s]{1,10})$');
var KB = 1024;
var MB = 1000000;
var GB = 1000000000;
// replace last /index.html to get bucket root
var bucketUrl = document.location.href.replace(/\/[^\/]+$/, '');
var compiledTemplate;
// return e.g. 1.2KB, 1.3MB, 2GB, etc.
function toFriendlySizeName(size){
if (size === 0) {
return '';
} else if (size < KB) {
return size + ' B';
} else if (size < MB) {
return (size / KB).toFixed(0) + ' KB';
} else if (size < GB) {
return (size / MB).toFixed(2) + ' MB';
}
return (size / GB).toFixed(2) + ' GB';
}
// POJO describing a file or a folder
function FileOrFolder(lastModified, etag, size, key){
var self = this;
self.lastModified = lastModified;
self.etag = etag;
self.size = size;
self.key = key;
// init logic
self.isFolder = FOLDER_PATTERN.test(self.key);
self.filename = self.isFolder ? self.key.replace(FOLDER_PATTERN,'') : self.key;
self.url = bucketUrl + '/' + self.key;
self.levels = self.filename.split('/');
self.numLevels = self.levels.length;
self.simpleFilename = self.levels[self.numLevels - 1];
self.friendlySizeName = toFriendlySizeName(parseInt(self.size,10));
var foundTypes = TYPE_PATTERN.exec(self.simpleFilename);
self.type = self.isFolder ? 'Folder ' : (foundTypes ? (foundTypes[1].toUpperCase() + ' file') : 'Unknown');
self.friendlyLastModified = moment(lastModified).format('MMM Do YYYY, hh:mm:ss a');
}
function onAjaxSuccess(xml) {
var listBucketResult = $(xml).find('ListBucketResult');
// set a reasonable title instead of "Bucket loading"
var title = 'Index of bucket "' + listBucketResult.find('Name').text() + '"';
document.title = title;
$('#h1-title').text(title);
var $tbodyContent = $('#tbody-content');
// create the file or folder objects
var filesOrFolders = [];
listBucketResult.find('Contents').each(function(idx, element){
var $element = $(element);
var fileOrFolder = new FileOrFolder(
$element.find('LastModified').text(),
$element.find('ETag').text(),
$element.find('Size').text(),
$element.find('Key').text()
);
filesOrFolders.push(fileOrFolder);
});
// sort
filesOrFolders.sort(function(left, right){
if (left.levels === right.levels) {
return 0;
} else if (left.levels < right.levels) {
return -1;
}
return 1;
});
// fill in the rows
var str = '';
for (var i = 0; i < filesOrFolders.length; i ++) {
str += compiledTemplate(filesOrFolders[i]);
}
$tbodyContent.append(str);
$('.hide-while-loading').show();
}
$.ajax({
url: bucketUrl,
success: onAjaxSuccess
});
// compile while ajax is in progress
compiledTemplate = Handlebars.compile($('#file-or-folder').html());
})(jQuery);
</script>
</body>
</html>
Upvotes: 2
Views: 150
Reputation: 336
After read the discussion provided by @colidyre.
I use requests_html
library to request the get petition. This library download chromium
web explorer to the pc when you use render
method for first time. This method execute the javascript in chromium
to render the page completely.
This library has two version class for this endevour:
I had to use the async one. The syncronous version can be found in the docs.
This is simple and other methods implies to install a server but for me that was an overkill because this is not a frecuent operation to me.
# Libraries
from requests_html import AsyncHTMLSession
import requests
# Session and request
asession = AsyncHTMLSession()
r = await asession.get('https://divvy-tripdata.s3.amazonaws.com/index.html')
await r.html.arender(sleep=1) # The sleep arg is necessary I don't know why...
r.close()
# Processing and saving to a file the links
links = r.html.links
dir_path = "data/"
path_file = dir_path + "url_files.txt"
with open(path_file, mode='w') as url_files:
for link in links:
if link.split('.')[-1] == 'zip':
url_files.write(link + '\n')
# Download data
with open(path_file, mode='r') as url_file:
for link in url_file:
link = link[0:-1] # rid the \n character
response = requests.get(link)
file_name = link.split('/')[-1]
with open(dir_path + file_name, mode='wb') as zipfile:
zipfile.write(response.content)
print(f'succcesful downloaded file: {file_name}')
Upvotes: 1