Hudlommen
Hudlommen

Reputation: 89

web-scraping pubmeds 2., 3., 4... page

Trying to web-scrape PubMed but i need to get to "page 2" through, well i am not too sure what kind of code.

So, i have looked at this link: Web Scraping - Get to Page 2

And i am quite certain that it holds the answer, i just do not know exactly how to implement it in my situation. What variables to use and what to send.

All the other posts about web-scraping and PubMed are about different things.

My code:

import requests
from bs4 import BeautifulSoup

params = {
    'name': "EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page",
    'title': "Next page of results",
    'class': "active page_link next",
    'href': "#",
    'sid': 3,
    'page': 3,
    'accesskey': "k",
    'id': "EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page"
}

page_link = 'https://www.ncbi.nlm.nih.gov/pubmed/?term=emergency+nurse+AND+pain'
page_response = requests.get(page_link, timeout=5, params=params)
page_content = BeautifulSoup(page_response.content, "html.parser")

print(page_content)

The code that the "Next" button calls (this i code from page 2):

<a name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page" title="Next page of results" class="active page_link next" href="#" sid="3" page="3" accesskey="k" id="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page">Next &gt;</a>

its a part of all of this:

<div class="title_and_pager">
            <div><h2>Search results</h2><h3 class="result_count left">Items: 201 to 400 of 367719</h3><span id="result_sel" class="nowrap"></span><input name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_ResultsController.ResultCount" sid="1" type="hidden" id="resultcount" value="367719" /><input name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_ResultsController.RunLastQuery" sid="1" type="hidden" /></div>
            <div class="pagination"><a name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page" title="First page of results" class="active page_link" href="#" sid="1" page="1" id="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page">&lt;&lt; First</a><a name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page" title="Previous page of results" class="active page_link prev" href="#" sid="2" page="1" accesskey="j" id="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page">&lt; Prev</a><h3 class="page"><label for="pageno">Page </label><input name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.cPage" id="pageno" type="text" class="num" sid="1" value="2" last="1839" /> of 1839</h3><a name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page" title="Next page of results" class="active page_link next" href="#" sid="3" page="3" accesskey="k" id="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page">Next &gt;</a><a name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page" title="Last page of results" class="active page_link" href="#" sid="4" page="1839" id="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page">Last &gt;&gt;</a><input name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.CurrPage" sid="1" type="hidden" value="2" /></div>
        </div>    

I can obviously scrape all from "page 1" but i need to scrape all the pages. I just need a hint of how to set i up, not the whole code all working to perfection. I know you guys have better things to do.

Upvotes: 1

Views: 600

Answers (1)

user14002256
user14002256

Reputation:

I've noticed that the website you are trying to read has a pattern in its URL. For every page, the URL end changes to page=NUMBER. So, the first page has the URL:

"https://www.ncbi.nlm.nih.gov/pubmed/?term=emergency+nurse+AND+pain"

That I found out to be the same link as:

"https://pubmed.ncbi.nlm.nih.gov/?term=emergency%20nurse%20AND%20pain&page=1"

Page 2 has the URL:

"https://pubmed.ncbi.nlm.nih.gov/?term=emergency%20nurse%20AND%20pain&page=2"

And so on. You could loop through the 85 pages and scan each of them with a simple for loop:

import requests

for i in range(84):
    response = requests.get(url="https://pubmed.ncbi.nlm.nih.gov/?term=emergency%20nurse%20AND%20pain&page=" + str(i + 1))
    # read page...
    

If you have any questions, let me know! I hope I could help you!

Upvotes: 1

Related Questions