Reputation: 55
I want to get a raw data using password from certain locked pastebin link with python. I can't figure out what to do.
Is it impossible to get pastebin raw data using python's requests module and post method? I tried it as below code but it returns error.
url = "https://pastebin.com/URL"
pass_data = {'PostPasswordVerificationForm[password]': 'password'}
res = requests.post(url, headers=headers, data = pass_data)
text = res.text
print(text)
It returns below error:
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='pastebin.com', port=443):
Max retries exceeded with url: /URL (Caused by SSLError(SSLCertVerificationError
(1, '[SSL: CERTIFICATE_VERIFY_FAILED]certificate verify failed:
self signed certificate in certificate chain (_ssl.c:1123)')))
Can someone please tell me which one I can use?
Upvotes: 2
Views: 7699
Reputation: 12189
Note: Consider using Pastebin's API and Pastebin's scraping API.
Your certificate verification failed (proxy/tor/vpn/web without cert/misconfigured web?). If you still want to proceed, simply use verify=False
as an argument for the requests.post()
:
requests.post(url="...", verify=False)
If you are using a VPN, perhaps you've been provided with a root certificate for your machine and you can apply it with cert=("path to cert", "path to key")
.
If you are using Tor, better skip that circuit and re-create a new one.
For proxy, it's complicated and can be either cert issue or just being plainly misconfigured/broken.
You can verify there's no proxy used by checking your network sessings (OS specific) and environment variables requests
package works with:
http_proxy
HTTP_PROXY
https_proxy
HTTPS_PROXY
curl_ca_bundle
Edit: I've just re-checked Pastebin, the RAW text option is only available for the unprotected pastes. However, you can get to the HTML version by inspecting the traffic, then re-assembling it with code simply by keeping the session, checking cookies and headers in the network tab. You should get something like this:
import requests as r
ses = r.Session()
cookie = ses.get("https://pastebin.com").cookies["_csrf-frontend"]
# The missing step here is reworking the provided CSRF by client-side
# JS which is "hidden" in the minified jquery.min.js (or at least the
# `POST` is issued by it). Once you have it, you can put it to the
# data field
print(ses.post(
url='https://pastebin.com/<your paste>',
headers={
'User-Agent': "<user agent to spoof it's via Requests>",
'Accept': (
'text/html'
',application/xhtml+xml'
',application/xml'
';q=0.9,image/webp,*/*;q=0.8'
),
'Accept-Language': 'en-US,en;q=0.5',
'Content-Type': 'application/x-www-form-urlencoded'
},
data=(
'_csrf-frontend=<JS-manipulated CSRF value>'
'&is_burn=1'
'&PostPasswordVerificationForm%5Bpassword%5D=<pass>'
)
).text)
Afterwards just check for the tag with RAW
in it and then parse it either by some quick regex (obligatory "it's a stupid idea" post) or use a less error-prone solution such as BeautifulSoup.
Nevertheless, captchas, IP blacklisting, "clever" CSRF handling and similar stuff will eventually prevent you from such scraping and if not it's just too easy to assemble an application that will dynamically change its class names, tag names, etc in Angular just to mess with your scraping for the lulz (Google Docs love this stuff, personal experience), so if you intend to do something serious with it, just use the API.
Edit2: Minor FAQ for scraping / why to use the API
Upvotes: 3