Cödingers Cat
Cödingers Cat

Reputation: 37

lxml errors with scraping website html for text data. Tried several iterations

I'm trying to get the congress house members text attributes from the https://www.congress.gov/members website. I'm very new at this. I followed a tutorial on youtube and think I am very close.

Here is a snippet of the html info I am trying to get. Text shown in bold.

Picture of the HTML from website

Here is my syntax that I think gets me the closest (using python 2.7 - work constraints) :

import requests, lxml
import lxml.html
#from bs4 import BeautifulSoup

html = requests.get('https://www.congress.gov/members?q=%7B%22congress%22%3A%22117%22%2C%22chamber%22%3A%22Senate%22%7D')
doc = lxml.html.fromstring(html.content)

house = doc.xpath('//div[@id="houseMemberNavigator"]')[0]

print(house)#got printed element div

members = house.xpath('.//select[@id="members-representatives"]/text()')
#returns ['\n        ', '                        ']

print(members)

I'm sure it's my syntax but have not been able to solve....

Upvotes: 0

Views: 72

Answers (1)

It_is_Chris
It_is_Chris

Reputation: 14103

Using BeautifulSoup

soup = BeautifulSoup(html.text, 'lxml')
[data.text for data in soup.find(id='members-representatives').select('option[value]')]

['Find a Representative',
 'Adams, Alma S. [D-NC-12]',
 'Aderholt, Robert B. [R-AL-4]',
 'Aguilar, Pete [D-CA-31]',
 'Allen, Rick W. [R-GA-12]',
 'Allred, Colin Z. [D-TX-32]',
 'Amodei, Mark E. [R-NV-2]',
 'Armstrong, Kelly [R-ND]',
 'Arrington, Jodey C. [R-TX-19]',
 'Auchincloss, Jake [D-MA-4]',
 'Axne, Cynthia [D-IA-3]',
 'Babin, Brian [R-TX-36]',
  ...]

Upvotes: 1

Related Questions