Kaitlin
Kaitlin

Reputation: 59

Webscraping with BeautifulSoup in Python tags

I am currently trying to scrape some information from the following link:

http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument

I would like to scrape some of the information in the table using BeautifulSoup in Python. Ideally I would like to scrape the "Groupo Parliamentario," "Titulo," "Sumilla," and "Autores" from the table as separate items.

So far I've developed the following code using BeautifulSoup:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find('table', {'bordercolor' : '#6583A0'})
contents = []
summary = []
authors = []
contents.append(table.findAll('font'))
authors.append(table.findAll('a'))

What I'm struggling with is that the code to scrape the authors only scrapes the first author in the list. Ideally I need to scrape all of the authors in the list. This seems odd to me because looking at the html code for the webpage, all authors in the list are indicated with '<a href = >' tags. I would think table.findAll('a')) would grab all of the authors in the list then.

Finally, I'm sort of just dumping the rest of the very messy html (title, summary, parliamentary group) all into one long string under contents. I'm not sure if I'm missing something, I'm sort of new to html and webscraping, but would there be a way to pull these items out and store them individually (ie: storing just the title in an object, just the summary in an object, etc). I'm having a tough time identifying unique tags to do this in the code for the web page. Or is this something I should just clean and parse after scraping?

Upvotes: 1

Views: 86

Answers (1)

kederrac
kederrac

Reputation: 17322

to get the authors you can use:

soup.find('input', {'name': 'NomCongre'})['value']

output:

'Santa María Calderón  Luis,Alva Castro  Luis,Armas Vela  Carlos,Cabanillas Bustamante  Mercedes,Carrasco Távara  José,De la Mata Fernández  Judith,De La Puente Haya  Elvira,Del Castillo Gálvez  Jorge,Delgado Nuñez Del Arco  José,Gasco Bravo  Luis,Gonzales Posada  Eyzaguirre  Luis,León Flores  Rosa Marina,Noriega Toledo  Víctor,Pastor Valdivieso  Aurelio,Peralta Cruz  Jonhy,Zumaeta Flores  César'

to scrape Grupo Parlamentario

table.find_all('td', {'width': 446})[1].text

output:

'Célula Parlamentaria Aprista'

to scrape Título:

table.find_all('td', {'width': 446})[2].text

output:

'IGV/SELECTIVO:D.L.821/LEY INTERPRETATIVA '

to scrape Sumilla:

table.find_all('td', {'width': 446})[3].text

output:

'  Propone la aprobación de una Ley Interpretativa del Texto Original del Numeral 1 del Apéndice II del Decreto Legislativo N°821,modificatorio del Texto Vigente del Numeral 1 del Apéndice II del Texto Único Ordenado de la Ley del Impuesto General a las Ventas y Selectivo al Consumo,aprobado por Decreto Supremo N°054-99-EF. '

Upvotes: 1

Related Questions