jdoe
jdoe

Reputation: 654

Efficently getting ID's from PubMed

I am currently working on finding direct links between citations on PubMed/MEDLINE and clinical trial registrations. Specifically, given a single PMID I wish to find all of the ID's of the citation in any clinical trial registry. (For example, see PMID 29593018 which has id ACTRN12616000470493)

Currently, I am only searching for links to ClinicalTrials.gov (form of id: NCT followed by 8 digit number (eg. NCT01435343)) using the following regex:

attributes = {'mdTitle': 'High-dose versus standard-dose amoxicillin/clavulanate for clinically-diagnosed acute bacterial sinusitis: A randomized clinical trial.', 'mdAbstract': 'BACKGROUND: The recommended treatment for acute bacterial sinusitis in adults, amoxicillin with clavulanate, provides only modest benefit. OBJECTIVE: To see if a higher dose of amoxicillin will lead to more rapid improvement. DESIGN, SETTING, AND PARTICIPANTS: Double-blind randomized trial in which, from November 2014 through February 2017, we enrolled 315 adult outpatients diagnosed with acute sinusitis in accordance with Infectious Disease Society of America guidelines. INTERVENTIONS: Standard-dose (SD) immediate-release (IR) amoxicillin/clavulanate 875 /125 mg (n = 159) vs. high-dose (HD) (n = 156). The original HD formulation, 2000 mg of extended-release (ER) amoxicillin with 125 mg of IR clavulanate twice a day, became unavailable half way through the study. The IRB then approved a revised protocol after patient 180 to provide 1750 mg of IR amoxicillin twice a day in the HD formulation and to compare Time Period 1 (ER) with Time Period 2 (IR). MAIN MEASURE: The primary outcome was the percentage in each group reporting a major improvement-defined as a global assessment of sinusitis symptoms as "a lot better" or "no symptoms"-after 3 days of treatment. KEY RESULTS: Major improvement after 3 days was reported during Period 1 by 38.8% of ER HD versus 37.9% of SD patients (P = 0.91) and during Period 2 by 52.4% of IR HD versus 34.4% of SD patients, an effect size of 18% (95% CI 0.75 to 35%, P = 0.04). No significant differences in efficacy were seen at Day 10. The major side effect, severe diarrhea at Day 3, was reported during Period 1 by 7.4% of HD and 5.7% of SD patients (P = 0.66) and during Period 2 by 15.8% of HD and 4.8% of SD patients (P = 0.048). CONCLUSIONS: Adults with clinically diagnosed acute bacterial sinusitis were more likely to improve rapidly when treated with IR HD than with SD but not when treated with ER HD. They were also more likely to suffer severe diarrhea. Further study is needed to confirm these findings. TRIAL REGISTRATION: ClinicalTrials.gov Identifier: NCT02340000.', 'mdMesh': '', 'mdPMID': '29738561', 'mdPublicationType': ['Journal Article'], 'mdAuthor': ['Matho A', 'Mulqueen M', 'Tanino M', 'Quidort A', 'Cheung J', 'Pollard J', 'Rodriguez J', 'Swamy S', 'Tayler B', 'Garrison G', 'Ata A', 'Sorum P'], 'mdDataPublished': '2018', 'mdPMC': '', 'mdSI': ['ClinicalTrials.gov/NCT02340000'], 'mdAID': ['10.1371/journal.pone.0196734 [doi]', 'PONE-D-17-43190 [pii]'], 'mdDOI': ['10.1371/journal.pone.0196734 [doi]', 'PONE-D-17-43190 [pii]'], 'mdSO': 'PLoS One. 2018 May 8;13(5):e0196734. doi: 10.1371/journal.pone.0196734. eCollection 2018.', 'mdLanguage': ['English']}

dictString = ', '.join("{!s}={!r}".format(key,val) for (key,val) in attributes.items())
for each in dictString.split(' '):
    if re.match(r'(NCT)\d{8}', each):
        print (each.strip('.\','))

However, PubMed/MEDLINE also contains 40 other clinical trial registration ID's. I wish to also get these ID's. How can I do this more efficently than writing 40 more regex statements?

Note: To clarify, I need to identify each ID and each ID's body. (i.e ClinicalTrials.Gov for NCT01435343 or Australian New Zealand Clinical Trials Registry for ACTRN12616000470493)

Upvotes: 1

Views: 251

Answers (1)

Arthur Dent
Arthur Dent

Reputation: 1952

I haven't looked at a bunch to know if the same pattern applies, but if they always follow text that says "TRIAL REGISTRATION NUMBER:" inside html <h4> tags, you could parse the actual html document for <h4> tags containing this term, then take the text from the following paragraph in <p> tags. BeautifulSoup makes this relatively straightforward.

But again, you've only shown one example. I don't know if it always follows this pattern or not. From there they appear to be semicolon-delimited, which is simple to split on.

Upvotes: 2

Related Questions