Python: html table content

Question

I am trying to scrape this website but I keep getting error when I try to print out just the content of the table.

soup = BeautifulSoup(urllib2.urlopen('http://clinicaltrials.gov/show/NCT01718158
').read())

print soup('table')[6].prettify()


for row in soup('table')[6].findAll('tr'):
    tds = row('td')
    print tds[0].string,tds[1].string

IndexError                                Traceback (most recent call last)
 in ()
  1 for row in soup('table')[6].findAll('tr'):
  2     tds = row('td')
  3     print tds[0].string,tds[1].string
  4 

IndexError: list index out of range

Martijn Pieters · Accepted Answer

The table has a header row, with header elements rather than cells. Your code assumes there will always be elements in each row, and that fails for the first row.

You could skip the row with not enough elements:

for row in soup('table')[6].findAll('tr'):
    tds = row('td')
    if len(tds) < 2:
        continue
    print tds[0].string, tds[1].string

at which point you get output:

>>> for row in soup('table')[6].findAll('tr'):
...     tds = row('td')
...     if len(tds) < 2:
...         continue
...     print tds[0].string, tds[1].string
... 
Responsible Party: Bristol-Myers Squibb
ClinicalTrials.gov Identifier: None
Other Study ID Numbers: AI452-021, 2011‐005409‐65
Study First Received: October 29, 2012
Last Updated: November 7, 2014
Health Authority: None

The last row contains text interspersed with elements; you could use the element.strings generator to extract all strings and perhaps join them into newlines; I'd strip each string first though:

>>> for row in soup('table')[6].findAll('tr'):
...     tds = row('td')
...     if len(tds) < 2:
...         continue
...     print tds[0].string, '
'.join(filter(unicode.strip, tds[1].strings))
... 
Responsible Party: Bristol-Myers Squibb
ClinicalTrials.gov Identifier: NCT01718158
History of Changes
Other Study ID Numbers: AI452-021, 2011‐005409‐65
Study First Received: October 29, 2012
Last Updated: November 7, 2014
Health Authority: United States: Institutional Review Board
United States: Food and Drug Administration
Argentina: Administracion Nacional de Medicamentos, Alimentos y Tecnologia Medica
France: Afssaps - Agence française de sécurité sanitaire des produits de santé (Saint-Denis)
Germany: Federal Institute for Drugs and Medical Devices
Germany: Ministry of Health
Israel: Israeli Health Ministry Pharmaceutical Administration
Israel: Ministry of Health
Italy: Ministry of Health
Italy: National Bioethics Committee
Italy: National Institute of Health
Italy: National Monitoring Centre for Clinical Trials - Ministry of Health
Italy: The Italian Medicines Agency
Japan: Pharmaceuticals and Medical Devices Agency
Japan: Ministry of Health, Labor and Welfare
Korea: Food and Drug Administration
Poland: National Institute of Medicines
Poland: Ministry of Health
Poland: Ministry of Science and Higher Education
Poland: Office for Registration of Medicinal Products, Medical Devices and Biocidal Products
Russia: FSI Scientific Center of Expertise of Medical Application
Russia: Ethics Committee
Russia: Ministry of Health of the Russian Federation
Spain: Spanish Agency of Medicines
Taiwan: Department of Health
Taiwan: National Bureau of Controlled Drugs
United Kingdom: Medicines and Healthcare Products Regulatory Agency

Python: html table content

Answers (1)

Related Questions