Reputation: 1
For a small project I need to extract insolvency announcements from the following website: https://neu.insolvenzbekanntmachungen.de/ap/suche.jsf What I need after typing in todays date and selecting "Eröffnungen" from the dropdown menu of "Gegenstand der Veröffentlichung" and clicked on “search” at the bottom is the text that is hidden behind the zoom icon on the page afterwards.
The Python code I use to implement this is the following:
import...
driver = webdriver.Safari()
driver.maximize_window()
driver.get("https://neu.insolvenzbekanntmachungen.de/ap/suche.jsf")
date_input = WebDriverWait(driver, 10).until(
ec.visibility_of_element_located((By.ID, "frm_suche:ldi_datumVon:datumHtml5"))
)
curr_date = datetime.today().strftime("%Y-%m-%d")
arg_value = f"arguments[0].value = '{curr_date}';"
driver.execute_script("arguments[0].value = '';", date_input)
driver.execute_script(arg_value, date_input)
# Select the "Eröffnungen" option from the dropdown menu
select_element = WebDriverWait(driver, 20).until(
ec.visibility_of_element_located((By.ID, "frm_suche:lsom_gegenstand:lsom"))
)
driver.execute_script("arguments[0].scrollIntoView(true);", select_element)
time.sleep(1)
select = Select(select_element)
select.select_by_value("2")
# Execute the search
driver.execute_script("arguments[0].dispatchEvent(new Event('change'));", date_input)
date_input.send_keys(Keys.RETURN)
results_table = WebDriverWait(driver, 20).until(
ec.visibility_of_element_located((By.ID, "tbl_ergebnis"))
)
rows = results_table.find_elements(By.TAG_NAME, "tr")
data = []
# Iterate over each row of the table
for i, row in enumerate(rows):
cells = row.find_elements(By.TAG_NAME, "td")
if len(cells) > 0:
# Click the zoom icon to open the new window
zoom_icon = cells[6].find_element(By.TAG_NAME, "input[type='image']")
WebDriverWait(driver, 20).until(
ec.element_to_be_clickable((By.TAG_NAME, "input[type='image']"))
)
driver.execute_script("arguments[0].scrollIntoView(true);", zoom_icon)
time.sleep(1) # Wait for scroll
driver.execute_script("arguments[0].click();", zoom_icon)
# Wait for the new window to open
WebDriverWait(driver, 20).until(ec.new_window_is_opened)
driver.switch_to.window(driver.window_handles[1])
# Extract the publication text
print("here")
WebDriverWait(driver, 20).until(
ec.presence_of_element_located((By.XPATH,
"//form[@id='form']//pre[@id='veroefftext']"))
)
print("here2")
pub_text = driver.find_element(By.XPATH, "//form[@id='form']//pre[@id='veroefftext']").text
# Close the new window and switch to the first window again
driver.close()
driver.switch_to.window(driver.window_handles[0])
# Store the extracted data
data.append({
'veroeffentlichungsdatum': cells[0].text,
'aktenzeichen': cells[1].text,
'gericht': cells[2].text,
'name_vorname_bezeichnung': cells[3].text,
'sitz_wohnsitz': cells[4].text,
'register': cells[5].text,
'veroeffentlichungstext': pub_text
})
df = pd.DataFrame(data)
df.columns = ['veroeffentlichungsdatum', 'aktuelles_aktenzeichen', 'gericht',
'name_vorname_bezeichnung', 'sitz_wohnsitz', 'register', 'veroeffentlichungstext']
df.to_excel("data/insos.xlsx", index=False)
I'm relatively new to Selnium, but I think the code itself works well and does what it's supposed to. However, for the part between the two print statements (print(here), print(here2)) it keeps happening that the element cannot be localized and I get a TimeoutException. This seems to happen very randomly, as sometimes it works and sometimes it doesn't. I also already had a run where I was able to extract the text behind all zoom icons completely.
Could it be that the identifiers are not unique? I've already read related posts and most of the time it was suggesting using XPATH which I already implemented it. I already tried driver.switch_to.default_content() after the print(here) statement which was suggested in a related post but still the same problem.
Any help is much appreciated how I can make it stable as the script is supposed to be executed every day.
Upvotes: 0
Views: 54
Reputation: 250
Try the following.
Upvotes: 0
Reputation: 124
You explicitly wait for it to become located, you said it yourself that sometimes it works and sometimes it does not. You wait for 20 (ms or seconds, i don't know)
Based on what you described it looks like sometimes it becomes located fast enough and sometimes doesn't. This means you have to tune your waiting parameter and find an optimal solution when the element has enough time to locate itself.
Upvotes: 0