jdoe
jdoe

Reputation: 654

Scrape data from clinicalTrials.gov

I am working on a small Python function to scrape data from clinicalTrials.gov. From each Study Record, I wish to scrape the conditions that the study is targeting. For example, for this study record I want the following:

conditions = ['Rhinoconjunctivitis', 'Rhinitis', 'Conjunctivitis'. 'Allergy']

However, in each study record, there are different numbers of conditions. I have written the following script which gets the data:

page = requests.get('https://clinicaltrials.gov/ct2/show/study/NCT00550550')
soup = BeautifulSoup(page.text, 'html.parser')
studyDesign = soup.find_all(headers='studyInfoColData')
condition = soup.find(attrs={'class':'data_table'}).find_all('span')
for each in condition:
    print(each.text.encode('utf-8').strip())

like so:

b'Condition or disease'
b'Intervention/treatment'
b'Phase'
b'Rhinoconjunctivitis'
b'Rhinitis'
b'Conjunctivitis'
b'Allergy'
b'Drug: Placebo'
b'Biological: SCH 697243'
b'Drug: Loratadine Syrup 1 mg/mL Rescue Treatment'
b'Drug: Loratadine 10 mg Rescue Treatment'
b'Drug: Olopatadine 0.1% Rescue Treatment'
b'Drug: Mometasone furoate 50 mcg Rescue Treatment'
b'Drug: Albuterol 108 mcg Rescue Treatment'
b'Drug: Fluticasone 44 mcg Rescue Treatment'
b'Drug: Prednisone 5 mg Rescue Treatment'
b'Phase 3'

How can I now only get the condition without the intervention/treatment info?

Upvotes: 2

Views: 3418

Answers (3)

Rohit
Rohit

Reputation: 1

The easiest way to scrape clinicaltrials.gov is to create an account on https://aact.ctti-clinicaltrials.org/connect and use the credentials to connect to the AACT PostgreSQL database that stores this data.

Among everything else, you can find the data that you are looking for in the ctgov.conditions table.

select name from ctgov.conditions where nct_id = 'NCT00550550';

Upvotes: 0

Tomasz Wiśniewski
Tomasz Wiśniewski

Reputation: 139

Maybe this code will help.

import requests
from bs4 import BeautifulSoup

#url = "https://clinicaltrials.gov/ct2/show/NCT02656888"
url = "https://clinicaltrials.gov/ct2/show/study/NCT00550550"

page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all("table", class_="data_table")

tds = [tr.find_all("td") for tr in table]
conditions = [condition for condition in (tds[0][0].get_text().split("\n")) if condition != ""]

print(conditions)

Upvotes: 1

Bertrand Martel
Bertrand Martel

Reputation: 45443

You can just use the first table with class data_table & extract span element in td :

import requests
from bs4 import BeautifulSoup

page = requests.get('https://clinicaltrials.gov/ct2/show/study/NCT00550550')
soup = BeautifulSoup(page.text, 'html.parser')
studyDesign = soup.find("table", {"class" : "data_table"}).find('td')
conditions = [ t.text.strip() for t in studyDesign.find_all('span') ]
print(conditions)

which gives :

[u'Rhinoconjunctivitis', u'Rhinitis', u'Conjunctivitis', u'Allergy']

Upvotes: 1

Related Questions