Reputation: 43
I want to scrape links to patents from a Google Patents Search using BeautifulSoup, but I'm not sure if Google converts their html into javascript, which cannot be parsed through BeautifulSoup, or what the issue is.
Here is some simple code:
url = 'https://patents.google.com/?assignee=Roche&after=priority:20110602&type=PATENT&num=100'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
links = []
for link in soup.find_all('a', href=True):
print(link['href'])
I also wanted to append the links into the list, but nothing is printed because there are no 'a' tags from the soup. Is there any way to grab the links to all of the patents?
Upvotes: 1
Views: 1629
Reputation: 3400
Data is dynamically render so its hard to get from bs4
so what you can try go to chrome developer mode.
Then go to Network tab you can now find xhr tab reload your web page so there will be links under Name tab from that one link is containing all data as json format
so you can copy the address of that link and you can use requests
module make call and now you can extract what so ever data you want
also if you want individual link so it is made of publication_number and you can join it with old link to get links of publications.
import requests
main_url="https://patents.google.com/"
params="?assignee=Roche&after=priority:20110602&type=PATENT&num=100"
res=requests.get("https://patents.google.com/xhr/query?url=assignee%3DRoche%26after%3Dpriority%3A20110602%26type%3DPATENT%26num%3D100&exp=")
main_data=res.json()
data=main_data['results']['cluster']
for i in range(len(data[0]['result'])):
num=data[0]['result'][i]['patent']['publication_number']
print(num)
print(main_url+"patent/"+num+"/en"+params)
Output:
US10287352B2
https://patents.google.com/patent/US10287352B2/en?assignee=Roche&after=priority:20110602&type=PATENT&num=100
US10364292B2
https://patents.google.com/patent/US10364292B2/en?assignee=Roche&after=priority:20110602&type=PATENT&num=100
US10494633B2
.....
Upvotes: 2