Reputation: 642
So I am working on trying to webscrape https://data.bls.gov/cgi-bin/surveymost?bls and was able to figure out how to webcrawl through clicks to get to a table.
The selection that I am practicing on is after you select the checkbox associated with " Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A" under Compensation and then select "Retrieve data".
Once those two are processed a table shows. This is the table I am trying to scrape.
Below is the code that I have as of right now.
Note that you have to put your own path for your browser driver where I have put < browser driver >.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import numpy as np
import requests
import lxml.html as lh
from selenium import webdriver
url = "https://data.bls.gov/cgi-bin/surveymost?bls"
ChromeSource = r"<browser driver>"
# Open up a Chrome browser and navigate to web page.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless') # will run without opening browser.
driver = webdriver.Chrome(ChromeSource, chrome_options=options)
driver.get(url)
driver.find_element_by_xpath("//input[@type='checkbox' and @value = 'CIU1010000000000A']").click()
driver.find_element_by_xpath("//input[@type='Submit' and @value = 'Retrieve data']").click()
i = 2
def myTEST(i):
xpath = '//*[@id="col' + str(i) + '"]'
TEST = driver.find_elements_by_xpath(xpath)
num_page_items = len(TEST)
for i in range(num_page_items):
print(TEST[i].text)
myTEST(i)
# Clean up (close browser once completed task).
driver.close()
Right now this only is looking at the headers. I would like to also get the table content as well.
If I make i = 0, it produces "Year". i = 1, it produces "Period". But if I select i = 2 I get two variables which have the same col2 id for "Estimated Value" and "Standard Error".
I tried to think of a way to work around this and can't seem to get anything that I have researched to work.
In essence, it would be better to start at the point where I am done clicking and am at the table of interest and then look at the xpath of the header and pull in the text for all of the sub 's.
<tr> == $0
<th id="col0"> Year </th>
<th id="col1"> Period </th>
<th id="col2">Estimated Value</th>
<th id="col2">Standard Error</th>
<tr>
I am not sure how to do that. I also tried to loop through the {i} but obviously sharing with two header text causes an issue.
Once I am able to get the header, I want to get the contents. I could you some insight on if I am on the right path, overthinking it or if there is a simpler way to do all of this. I am learning and this is my first attempt using the selenium library for clicks. I just want to get it to work so I can try it again on a different table and make it as automate or reusable (with tweaking) as possible.
Upvotes: 1
Views: 526
Reputation: 11505
Actually you don't need selenium
, You can just track the POST
Form data
, and apply the same within your POST
request.
Then you can load the table using Pandas
easily.
import requests
import pandas as pd
data = {
"series_id": "CIU1010000000000A",
"survey": "bls"
}
def main(url):
r = requests.post(url, data=data)
df = pd.read_html(r.content)[1]
print(df)
main("https://data.bls.gov/cgi-bin/surveymost")
Explanation:
Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A
Network Monitor
section. etc Press Ctrl + Shift + E ( Command + Option + E on a Mac).Now you will found a POST
request done.
Navigate to Params
tab.
Now you can make the POST
request. and since the Table
is presented within the HTML
source and it's not loaded via JavaScript
, so you can parse it within bs4
or read it in nice format using pandas.read_html()
Note: You can read the table as long as it's not loaded via JavaScript
. otherwise you can try to track the XHR
request (Check previous answer) or you can use selenium
or requests_html
to render JS
since requests
is an HTTP
library which can't render it for you.
Upvotes: 4