Diop Chopra
Diop Chopra

Reputation: 319

Extracting element from table using selenium web scraping

Using selenium I am trying to scrape a table from a website, however, data appears compressed into one single column rather than two separate columns; Date and Value- help would be greatly appreciated - now amended to included further improvement

driver.get("https://www.multpl.com/shiller-pe/table/by-year/")

table_trs = driver.find_elements(By.XPATH, '//table[@id="datatable"]/tbody/tr')

value_list = []
for v in range(len(table_trs )):
    value_list.append(table_trs [v].text)

print(value_list)
df = pd.DataFrame(value_list)

The table I am trying to scrape appears on the website as follows....

table

..and the section of html associated to it.....

html tree

Upvotes: 0

Views: 4385

Answers (2)

HedgeHog
HedgeHog

Reputation: 25048

Note: Answer is focused on correct usage of xpath and only based on your screenshot - Improving your question and posting code and examples as text would generate more specific answers

To get the <tr>s of the table by xpath change it to //table[@id="datatable"]/tbody/tr

from selenium.webdriver.common.by import By

table_trs = driver.find_elements(By.XPATH, '//table[@id="datatable"]/tbody/tr')

EDIT

Based on your improvements you can go with pandas.read_html() to convert the table directly into a dataframe. Just have to rename the columns, cause there are two span tag in the <th> that will lead to column name "Value Value":

driver.get("https://www.multpl.com/shiller-pe/table/by-year/")

df = pd.read_html(driver.page_source)[0]
df.columns = ['Date','Value']

As alternative you could iterate like this:

driver.get("https://www.multpl.com/shiller-pe/table/by-year/")

table_trs = driver.find_elements(By.XPATH, '//table[@id="datatable"]/tbody/tr')
value_list = []
for row in table_trs[1:]:
    value_list.append({
        'Date':row.find_elements(By.TAG_NAME, "td")[0].text,
        'Value':row.find_elements(By.TAG_NAME, "td")[1].text
    })

df = pd.DataFrame(value_list)

Output

Date Value
Feb 4, 2022 37.18
Jan 1, 2022 39.63
Jan 1, 2021 34.51
Jan 1, 2020 30.99
Jan 1, 2019 28.38
Jan 1, 2018 33.31
Jan 1, 2017 28.06
Jan 1, 2016 24.21
Jan 1, 2015 26.49
... ...

Upvotes: 3

Gaj Julije
Gaj Julije

Reputation: 2183

Something is missing here or you put wrong xpath. Valid xpath will be (based on picture) "//div[@id="datatable"]/tbody/tr", but that will only give you the rows. You can iterate by getting the all row nubers and colums, something like //div[@id="datatable"]/tbody/tr[i]/td[j] and then get text fro each element.

Upvotes: 1

Related Questions