Reputation: 654
I am working on a small Python function to scrape data from clinicalTrials.gov. From each Study Record, I wish to scrape the conditions that the study is targeting. For example, for this study record I want the following:
conditions = ['Rhinoconjunctivitis', 'Rhinitis', 'Conjunctivitis'. 'Allergy']
However, in each study record, there are different numbers of conditions. I have written the following script which gets the data:
page = requests.get('https://clinicaltrials.gov/ct2/show/study/NCT00550550')
soup = BeautifulSoup(page.text, 'html.parser')
studyDesign = soup.find_all(headers='studyInfoColData')
condition = soup.find(attrs={'class':'data_table'}).find_all('span')
for each in condition:
print(each.text.encode('utf-8').strip())
like so:
b'Condition or disease'
b'Intervention/treatment'
b'Phase'
b'Rhinoconjunctivitis'
b'Rhinitis'
b'Conjunctivitis'
b'Allergy'
b'Drug: Placebo'
b'Biological: SCH 697243'
b'Drug: Loratadine Syrup 1 mg/mL Rescue Treatment'
b'Drug: Loratadine 10 mg Rescue Treatment'
b'Drug: Olopatadine 0.1% Rescue Treatment'
b'Drug: Mometasone furoate 50 mcg Rescue Treatment'
b'Drug: Albuterol 108 mcg Rescue Treatment'
b'Drug: Fluticasone 44 mcg Rescue Treatment'
b'Drug: Prednisone 5 mg Rescue Treatment'
b'Phase 3'
How can I now only get the condition without the intervention/treatment info?
Upvotes: 2
Views: 3418
Reputation: 1
The easiest way to scrape clinicaltrials.gov is to create an account on https://aact.ctti-clinicaltrials.org/connect and use the credentials to connect to the AACT PostgreSQL database that stores this data.
Among everything else, you can find the data that you are looking for in the ctgov.conditions table.
select name from ctgov.conditions where nct_id = 'NCT00550550';
Upvotes: 0
Reputation: 139
Maybe this code will help.
import requests
from bs4 import BeautifulSoup
#url = "https://clinicaltrials.gov/ct2/show/NCT02656888"
url = "https://clinicaltrials.gov/ct2/show/study/NCT00550550"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all("table", class_="data_table")
tds = [tr.find_all("td") for tr in table]
conditions = [condition for condition in (tds[0][0].get_text().split("\n")) if condition != ""]
print(conditions)
Upvotes: 1
Reputation: 45443
You can just use the first table
with class data_table
& extract span
element in td
:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://clinicaltrials.gov/ct2/show/study/NCT00550550')
soup = BeautifulSoup(page.text, 'html.parser')
studyDesign = soup.find("table", {"class" : "data_table"}).find('td')
conditions = [ t.text.strip() for t in studyDesign.find_all('span') ]
print(conditions)
which gives :
[u'Rhinoconjunctivitis', u'Rhinitis', u'Conjunctivitis', u'Allergy']
Upvotes: 1