Reputation: 75
I have been trying hours to sort this out but unable to do so.
Here is my script using Selenium Webdriver in Python, trying to extract title, date, and link. I am able to extract the title and link. However, I am stuck at extracting the date. Could someone please help me with this. Much appreciated your response.
import selenium.webdriver
import pandas as pd
frame=[]
url = "https://www.oric.gov.au/publications/media-releases"
driver = selenium.webdriver.Chrome("C:/Users/[Computer_Name]/Downloads/chromedriver.exe")
driver.get(url)
all_div = driver.find_elements_by_xpath('//div[contains(@class, "ui-accordion-content")]')
for div in all_div:
all_items = div.find_elements_by_tag_name("a")
for item in all_items:
title = item.get_attribute('textContent')
link = item.get_attribute('href')
date =
frame.append({
'title': title,
'date': date,
'link': link,
})
dfs = pd.DataFrame(frame)
dfs.to_csv('myscraper.csv',index=False,encoding='utf-8-sig')
Here is the html I am interested in:
<div id="ui-accordion-1-panel-0" ...>
<div class="views-field views-field-title">
<span class="field-content">
<a href="/publications/media-release/ngadju-corporation-emerges-special-administration-stronger">
Ngadju corporation emerges from special administration stronger
</a>
</span>
</div>
<div class="views-field views-field-field-document-media-release-no">
<div class="field-content"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-07-31T00:00:00+10:00">
31 July 2020
</span> (MR2021-06)</div>
</div>
</div>
...
Upvotes: 0
Views: 843
Reputation:
Ok just search for the <span
> with the property dc:date
, save it in a WebElement dateElement and take its text dateElement.text
. That's your date as string.
Upvotes: 1
Reputation:
I'd get all rows first.
from pprint import pprint
import selenium.webdriver
frame = []
url = "https://www.oric.gov.au/publications/media-releases"
driver = selenium.webdriver.Chrome()
driver.get(url)
divs = driver.find_elements_by_css_selector('div.ui-accordion-content')
for div in divs:
rows = div.find_elements_by_css_selector('div.views-row')
for row in rows:
item = row.find_element_by_tag_name('a')
title = item.get_attribute('textContent')
link = item.get_attribute('href')
date = row.find_element_by_css_selector(
'span.date-display-single').get_attribute('textContent')
frame.append({
'title': title,
'date': date,
'link': link,
})
driver.quit()
pprint(frame)
print(len(frame))
Upvotes: 1