Reputation: 1305
I am trying to access a proprietary website which provides access to a large database. The database is quite large (many billions of entries). Each entry in the database is a link to a webpage that is essentially a flat file containing the information that I need. I have about 2000 entries from the database and their corresponding webpages in the database. I have two related issues that I am trying to resolve:
wget
(or any other similar program) to read cookie data. I downloaded my cookies from google chrome (using: https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg?hl=en) but for some reason the html downloaded by wget
still cannot be rendered as a webpage. Similarly, I have not been able to get Google Chrome
from the command line to read cookies. These cookies are needed to access the database, since they contain my credentials.wget
or similar tools. I tried using automate-save-page-as
(https://github.com/abiyani/automate-save-page-as) but I continuously get an error of the browser not being in my PATH.Upvotes: 0
Views: 906
Reputation: 1305
I solved both of these issues:
Problem 1: I switched away from wget
, curl
and python's requests
to simply using the selenium
webdriver in python. Using selenium, I did not have to deal with issues such as passing cookies
,headers
, post
and get
, since it actually opens a browser. This also has a plus that as I was writing the script to use selenium, I could inspect the page and see what it was doing as it was doing it.
Problem 2: Selenium has a method called page_source
, which downloaded the html of the webpage. When I tested it, it rendered the html correctly.
Upvotes: 1