Mango
Mango

Reputation: 17

Scraping Table from Website with Selenium Returning Empty DataFrame

I just start learning web scraping and trying to extract data from the 'Holdings' table from https://www.ishares.com/us/products/268752/ishares-global-reit-etf

First, I use pandas but it returns me empty dataframe. I found out later that this table is dynamic and I need to use selenium. But then again, it also returns me empty dataframe. Could anyone help me with this please? Will really appreciate it.

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

# Instantiate options
options = webdriver.ChromeOptions()
options.headless = True

# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver',options=options)
wd.get(site)

# Load the HTML page
html = wd.page_source

# Extract data with pandas
df = pd.read_html(html)
table = df[6]

Upvotes: 1

Views: 339

Answers (1)

undetected Selenium
undetected Selenium

Reputation: 193108

To extract the data from the Holdings table of iShares Global REIT ETF webpage you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:

Code Block:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

wd.get("https://www.ishares.com/us/products/268752/ishares-global-reit-etf")
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
df  = pd.read_html(data)
# df  = pd.read_html(data, flavor='html5lib')
print(df)

Console Output:

[  Ticker                                Name       Sector Asset Class  ...      CUSIP          ISIN    SEDOL  Accrual Date
0    PLD                   PROLOGIS REIT INC  Real Estate      Equity  ...  74340W103  US74340W1036  B44WZD7             -
1   EQIX                    EQUINIX REIT INC  Real Estate      Equity  ...  29444U700  US29444U7000  BVLZX12             -
2    PSA                 PUBLIC STORAGE REIT  Real Estate      Equity  ...  74460D109  US74460D1090  2852533             -
3    SPG       SIMON PROPERTY GROUP REIT INC  Real Estate      Equity  ...  828806109  US8288061091  2812452             -
4    DLR       DIGITAL REALTY TRUST REIT INC  Real Estate      Equity  ...  253868103  US2538681030  B03GQS4             -
5      O             REALTY INCOME REIT CORP  Real Estate      Equity  ...  756109104  US7561091049  2724193             -
6   WELL                       WELLTOWER INC  Real Estate      Equity  ...  95040Q104  US95040Q1040  BYVYHH4             -
7    AVB      AVALONBAY COMMUNITIES REIT INC  Real Estate      Equity  ...  053484101  US0534841012  2131179             -
8    ARE  ALEXANDRIA REAL ESTATE EQUITIES RE  Real Estate      Equity  ...  015271109  US0152711091  2009210             -
9    EQR             EQUITY RESIDENTIAL REIT  Real Estate      Equity  ...  29476L107  US29476L1070  2319157             -

[10 rows x 12 columns]]

Upvotes: 1

Related Questions