Reputation: 22440
I've written a script in python to get different links leading to different articles from a webpage. Upon running my script I can get them flawlessly. However, the problem I'm facing is that the article links traverse multiple pages as they are of big numbers to fit within a single page. if I click on the next page button, the attached information i can see in the developer tools which in reality produce an ajax call through post request. As there are no links attached to that next page button, I can't find any way to go on to the next page and parse links from there. I've tried with a post request
with that formdata
but it doesn't seem to work. Where am I going wrong?
Link to the landing page containing articles
This is the information I get using chrome dev tools when I click on the next page button:
GENERAL
=======================================================
Request URL: https://www.ncbi.nlm.nih.gov/pubmed/
Request Method: POST
Status Code: 200 OK
Remote Address: 130.14.29.110:443
Referrer Policy: origin-when-cross-origin
RESPONSE HEADERS
=======================================================
Cache-Control: private
Connection: Keep-Alive
Content-Encoding: gzip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: text/html; charset=UTF-8
Date: Fri, 29 Jun 2018 10:27:42 GMT
Keep-Alive: timeout=1, max=9
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03.m_8
NCBI-SID: CE8C479DB3510951_0083SID
Referrer-Policy: origin-when-cross-origin
Server: Apache
Set-Cookie: ncbi_sid=CE8C479DB3510951_0083SID; domain=.nih.gov; path=/; expires=Sat, 29 Jun 2019 10:27:42 GMT
Set-Cookie: WebEnv=1Jqk9ZOlyZSMGjHikFxNDsJ_ObuK0OxHkidgMrx8vWy2g9zqu8wopb8_D9qXGsLJQ9mdylAaDMA_T-tvHJ40Sq_FODOo33__T-tAH%40CE8C479DB3510951_0083SID; domain=.nlm.nih.gov; path=/; expires=Fri, 29 Jun 2018 18:27:42 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-UA-Compatible: IE=Edge
X-XSS-Protection: 1; mode=block
REQUEST HEADERS
========================================================
Accept: text/html, */*; q=0.01
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Content-Length: 395
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: ncbi_sid=CE8C479DB3510951_0083SID; _ga=GA1.2.1222765292.1530204312; _gid=GA1.2.739858891.1530204312; _gat=1; WebEnv=18Kcapkr72VVldfGaODQIbB2bzuU50uUwU7wrUi-x-bNDgwH73vW0M9dVXA_JOyukBSscTE8Qmd1BmLAi2nDUz7DRBZpKj1wuA_QB%40CE8C479DB3510951_0083SID; starnext=MYGwlsDWB2CmAeAXAXAbgA4CdYDcDOsAhpsABZoCu0IA9oQCZxLJA===
Host: www.ncbi.nlm.nih.gov
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03
Origin: https://www.ncbi.nlm.nih.gov
Referer: https://www.ncbi.nlm.nih.gov/pubmed
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
X-Requested-With: XMLHttpRequest
FORM DATA
========================================================
p$l: AjaxServer
portlets: id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity
load: yes
This is my script so far (the get request is working flawlessly if uncommented, but for the first page):
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"
# res = requests.get(geturl,headers={"User-Agent":"Mozilla/5.0"})
# soup = BeautifulSoup(res.text,"lxml")
# for items in soup.select("div.rslt p.title a"):
# print(items.get("href"))
FormData={
'p$l': 'AjaxServer',
'portlets': 'id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity',
'load': 'yes'
}
req = requests.post(posturl,data=FormData,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("div.rslt p.title a"):
print(items.get("href"))
Btw, the url in the browser becomes "https://www.ncbi.nlm.nih.gov/pubmed" when I click on the next page link.
I don't wish to go for any solution related to any browser simulator. Thanks in advance.
Upvotes: 9
Views: 1618
Reputation: 1745
Not to treat this question as an XY problem, as, if solved, should pose a very interesting solution BUT I have found a solution for this specific issue that is much more efficient: Using the NCBI's Entrez Programming Utilities and a handy, opensource, unofficial Entrez repo.
With the entrez.py
script from the Entrez repo in my PATH
, I've created this script that prints out the links just as you want them:
from entrez import on_search
import re
db = 'pubmed'
term = '"2015"[Date - Publication] : "3000"[Date - Publication]'
link_base = f'https://www.ncbi.nlm.nih.gov/{db}/'
def links_generator(db, term):
for line in on_search(db=db, term=term, tool='link'):
match = re.search(r'<Id>([0-9]+)</Id>', line)
if match: yield (link_base + match.group(1))
for link in links_generator(db, term):
print(link)
Output:
https://www.ncbi.nlm.nih.gov/pubmed/29980165
https://www.ncbi.nlm.nih.gov/pubmed/29980164
https://www.ncbi.nlm.nih.gov/pubmed/29980163
https://www.ncbi.nlm.nih.gov/pubmed/29980162
https://www.ncbi.nlm.nih.gov/pubmed/29980161
https://www.ncbi.nlm.nih.gov/pubmed/29980160
https://www.ncbi.nlm.nih.gov/pubmed/29980159
https://www.ncbi.nlm.nih.gov/pubmed/29980158
https://www.ncbi.nlm.nih.gov/pubmed/29980157
https://www.ncbi.nlm.nih.gov/pubmed/29980156
https://www.ncbi.nlm.nih.gov/pubmed/29980155
https://www.ncbi.nlm.nih.gov/pubmed/29980154
https://www.ncbi.nlm.nih.gov/pubmed/29980153
https://www.ncbi.nlm.nih.gov/pubmed/29980152
https://www.ncbi.nlm.nih.gov/pubmed/29980151
https://www.ncbi.nlm.nih.gov/pubmed/29980150
https://www.ncbi.nlm.nih.gov/pubmed/29980149
https://www.ncbi.nlm.nih.gov/pubmed/29980148
...
Which, if compared to the frontend page, are in the same order. :-)
Upvotes: 0
Reputation: 15376
The content is heavily dynamic, so it would be best to use selenium
or similar clients, but I realize that this wouldn't be practical as the number of results is so large. So, we'll have to analyse the HTTP requests submitted by the browser and simulate them with requests
.
The contents of next page are loaded by POST request to /pubmed
, and the post data are the input fields of the EntrezForm
form. The form submission is controlled by js (triggered when 'next page' button is clicked), and is preformed with the .submit()
method.
After some examination I discovered some interesting fields:
EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.CurrPage
and
EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.cPage
indicate the current and next page.
EntrezSystem2.PEntrez.DbConnector.Cmd
seems to preform a database query. If we don't submit this field the results won't change.
EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PageSize
and
EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PrevPageSize
indicate the number of results per page.
With that information I was able to get multiple pages with the script below.
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"
s = requests.session()
s.headers["User-Agent"] = "Mozilla/5.0"
soup = BeautifulSoup(s.get(geturl).text,"lxml")
inputs = {i['name']: i.get('value', '') for i in soup.select('form#EntrezForm input[name]')}
results = int(inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_ResultsController.ResultCount'])
items_per_page = 100
pages = results // items_per_page + int(bool(results % items_per_page))
inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PageSize'] = items_per_page
inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PrevPageSize'] = items_per_page
inputs['EntrezSystem2.PEntrez.DbConnector.Cmd'] = 'PageChanged'
links = []
for page in range(pages):
inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.CurrPage'] = page + 1
inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.cPage'] = page
res = s.post(posturl, inputs)
soup = BeautifulSoup(res.text, "lxml")
items = [i['href'] for i in soup.select("div.rslt p.title a[href]")]
links += items
for i in items:
print(i)
I'm requesting 100 items per page because higher numbers seem to 'break' the server, but you should be able to adjust that number with some error checking.
Finally, the links are displayed in descending order (/29960282
, /29960281
, ...), so I thought we could calculate the links without preforming any POST requests:
geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"
s = requests.session()
s.headers["User-Agent"] = "Mozilla/5.0"
soup = BeautifulSoup(s.get(geturl).text,"lxml")
results = int(soup.select_one('[name$=ResultCount]')['value'])
first_link = int(soup.select_one("div.rslt p.title a[href]")['href'].split('/')[-1])
last_link = first_link - results
links = [posturl + str(i) for i in range(first_link, last_link, -1)]
But unfortunately the results are not accurate.
Upvotes: 2