Reputation: 231
So I'm trying to parse data out of this webpage
But I don't need the whole dataset, I just need:
KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=
)I tried to write some code but I'm just a beginner at webscraping, so was wondering if anyone could help. Here is my attempted code, I tried using lxml and requests library.
import requests
from lxml import html
page = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json')
tree = html.fromstring(page.content)
#This will create a list of operators:
operators = tree.xpath('//span[@class="operators"]/text()')
print('Operators: ',operators)
My hope is to have an end result that looks like the JSON on the website minus all the unneeded info so operators:
[
{ "name": "Google",
"logs": [
{ description: "Google Argon2022 log",
log_id: "KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=" },
{ description: "GoogleArgon2023 log",
log_id: "6D7Q2j71BjUy51covIlryQPTy9ERa+zraeF3fW0GvW4=" }
}
....
{ "name": "CloudFlare",
"logs": [ ... ]
}
]
Upvotes: 0
Views: 236
Reputation: 8511
First, you want to access the raw file, and not the UI. Just like Kache mentioned, you can get the JSON using:
resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
obj = json.loads(base64.decodebytes(resp.text.encode()))
Then, you can use the following script to extract only the data you want:
import requests
import json
import base64
def extract_log(log):
keys = [ 'description', 'log_id' ]
return { key: log[key] for key in keys }
def extract_logs(logs):
return [ extract_log(log) for log in logs ]
def extract_operator(operator):
return {
'name': operator['name'],
'logs': extract_logs(operator['logs'])
}
def extract_certificates(obj):
return [ extract_operator(operator) for operator in obj['operators'] ]
def scrape_certificates(url):
resp = requests.get(url)
obj = json.loads(base64.decodebytes(resp.text.encode()))
return extract_certificates(obj)
def main():
out = scrape_certificates('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
print(json.dumps(out, indent=4))
if __name__ == '__main__':
main()
Upvotes: 2
Reputation: 16747
There is a link at the bottom right that lets you download the file directly: https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=JSON
Which lets you avoid HTML parsing altogether.
Here's Python code to extract it as a dict
:
resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
js = json.loads(base64.decodebytes(resp.text.encode()))
What remains of your question involves JSON and dict
traversal and basic coding, which you should be able to find answers in other questions.
Upvotes: 1