C.K.
C.K.

Reputation: 85

Getting the closest previous element with this attribute using BeautifulSoup

I am using BeautifulSoup to scrape metadata of journal articles and need to retrieve each article's category. For example, let's use this article. I've pasted the block of code I'm trying to parse below.

<div id="landingDetailPluginDiv" class="p20">
  <div class="article_category">CLINICAL</div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/measuring-overuse-with-electronic-health-records-data">Measuring Overuse With Electronic Health Records Data</a></div>
    <div class="article_plus">Thomas Isaac, MD, MBA, MPH; Meredith B. Rosenthal, PhD; Carrie H. Colla, PhD; Nancy E. Morden, MD, MPH; Alexander J. Mainor, JD, MPH; Zhonghe Li, MS; Kevin H. Nguyen, MS; Elizabeth A. Kinsella, BA; and Thomas D. Sequist, MD, MPH</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_category">FROM THE EDITORS</div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/the-health-information-technology-special-issue-has-it-become-a-mandatory-part-of-health-and-healthcare">The Health Information Technology Special Issue: Has IT Become a Mandatory Part of Health and Healthcare?</a></div>
    <div class="article_plus">Jacob Reider, MD</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_category">MANAGERIAL</div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/bridging-the-digital-divide-mobile-access-to-personal-health-records-among-patients-with-diabetes">Bridging the Digital Divide: Mobile Access to Personal Health Records Among Patients With Diabetes</a></div>
    <div class="article_plus">Ilana Graetz, PhD; Jie Huang, PhD; Richard J. Brand, PhD; John Hsu, MD, MBA, MSCE; Cyrus K. Yamin, MD; and Mary E. Reed, DrPH</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_category">POLICY</div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/electronic-health-record-superusers-and-underusers-in-ambulatory-care-practices">Electronic Health Record "Super-Users" and "Under-Users" in Ambulatory Care Practices</a></div>
    <div class="article_plus">Juliet Rumball-Smith, MBChB, PhD; Paul Shekelle, MD, PhD; and Cheryl L. Damberg, PhD</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/electronic-sharing-of-diagnostic-information-and-patient-outcomes">Electronic Sharing of Diagnostic Information and Patient Outcomes</a></div>
    <div class="article_plus">Darwyyn Deyo, PhD; Amir Khaliq, PhD; David Mitchell, PhD; and Danny R. Hughes, PhD</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/hospital-participation-in-meaningful-use-and-racial-disparities-in-readmissions">Hospital Participation in Meaningful Use and Racial Disparities in Readmissions</a></div>
    <div class="article_plus">Mark Aaron Unruh, PhD; Hye-Young Jung, PhD; Rainu Kaushal, MD, MPH; and Joshua R. Vest, PhD, MPH</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_category">WEB EXCLUSIVE</div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/a-costeffectiveness-analysis-of-cardiology-econsults-for-medicaid-patients">A Cost-Effectiveness Analysis of Cardiology eConsults for Medicaid Patients</a></div>
    <div class="article_plus">Daren Anderson, MD; Victor Villagra, MD; Emil N. Coman, PhD; Ianita Zlateva, MPH; Alex Hutchinson, MBA; Jose Villagra, BS; and J. Nwando Olayiwola, MD, MPH</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/electronic-health-record-problem-lists-accurate-enough-for-risk-adjustment">Electronic Health Record Problem Lists: Accurate Enough for Risk Adjustment?</a></div>
    <div class="article_plus">Timothy J. Daskivich, MD, MSHPM; Garen Abedi, MD, MS; Sherrie H. Kaplan, PhD, MPH; Douglas Skarecky, BS; Thomas Ahlering, MD; Brennan Spiegel, MD, MSHS; Mark S. Litwin, MD, MPH; and Sheldon Greenfield, MD</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/racialethnic-variation-in-devices-used-to-access-patient-portals">Racial/Ethnic Variation in Devices Used to Access Patient Portals</a></div>
    <div class="article_plus">Eva Chang, PhD, MPH; Katherine Blondon, MD, PhD; Courtney R. Lyles, PhD; Luesa Jordan, BA; and James D. Ralston, MD, MPH</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_text">
    <div class="current_article fl">
      <div class="article_title">Currently Reading</div>
      <div class="article_title b">Hospitalized Patients' and Family Members' Preferences for Real-Time, Transparent Access to Their Hospital Records</div>
      <div class="article_plus b">Michael J. Waxman, MD, MPH; Kurt Lozier, MBA; Lana Vasiljevic, MS; Kira Novakofski, PhD; James Desemone, MD; John O'Kane, RRT-NPS, MBA; Elizabeth M. Dufort, MD; David Wood, MBA; Ashar Ata, MBBS, PhD; Louis Filhour, PhD, RN; & Richard J. Blinkhorn
        Jr, MD</div>

As you can see from the snippet, there are multiple elements because each issue's table of contents is listed in a side panel on each article's web page. I only want to retrieve the article category specific to that article, so that means I need to retrieve the last <div class="article_category"> (in this case, WEB EXCLUSIVE) that comes before <div class="article_title b"> (Hospitalized Patients' and Family Members' Preferences for Real-Time, Transparent Access to Their Hospital Records). I am not sure if these elements should be treated as siblings.

Upvotes: 1

Views: 2252

Answers (2)

QHarr
QHarr

Reputation: 84465

You can use :has and :contains to specify via the title the element to match on then get the preceeding div. The + is an adjacent sibling combinator so we are specifying we want the element immediately before the matched element that is returned via the match on article title( .article_text:contains("A Cost-Effectiveness Analysis of Cardiology eConsults for Medicaid Patients").)


import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.ajmc.com/journals/issue/2018/2018-vol24-n1/hospitalized-patients-and-family-members-preferences-for-realtime-transparent-access-to-their-hospital-records')
soup = bs(r.content, 'lxml')
category = soup.select_one('.article_category:has(+.article_text:contains("A Cost-Effectiveness Analysis of Cardiology eConsults for Medicaid Patients"))').text
print(category)

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195553

To retrieve the category of this article (WEB EXCLUSIVE) from the side-bar, you can try this code (we first select the title of the article, then find appropriate div in the right side-bar and the previous tag which is article category):

import requests
from bs4 import BeautifulSoup

url = 'https://www.ajmc.com/journals/issue/2018/2018-vol24-n1/hospitalized-patients-and-family-members-preferences-for-realtime-transparent-access-to-their-hospital-records'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

title = soup.title.text
d = soup.select_one('#rightTabContent div.article_title:contains("{}")'.format(title))
print(d.find_previous('div', class_='article_category').text)

Prints:

WEB EXCLUSIVE

Further reading:

CSS Selector Reference

Upvotes: 1

Related Questions