Reputation: 127
I am trying to extract href values based on a unique id(s) where the digits after p vary but are all numerals and terminate in "
for example id="p4423234" id="p5547" id="p4124234" id="234"
<a href="/string/string-string.html" class="profile-enable" rel="nofollow"
id="p1234"> `
`
I can grep the value of p using
cat p_id.html | grep "id=\"p[0-9]\+\""
But I am unable to figure out how to return the href value with find_element_by_id in python selenium.
Thank you in advance for your help. I am new to web scraping but enjoying the challenge.
Upvotes: 1
Views: 407
Reputation: 764
Extending Avinash Raj answer:
`
import re
from bs4 import BeautifulSoup
# from selenium import webdrive
# driver = webdriver.Firefox()
# driver.get("http://example.com")
html = '''<a href="/string/string-string.html" class="profile-enable" rel="nofollow" id="p154234">
<a href="/string/string-foo.html" class="profile-enable" rel="nofollow" id="p1235">
<a href="/string/stricccng-bar.html" class="profile-enable" rel="nofollow" id="12555">
'''
#or
#html = driver.page_source
soup = BeautifulSoup(html)
# it will cover all cases id="p4423234" id="p5547" id="p4124234" id="234"
a = soup.find_all('a', attrs={'id': re.compile('^p?\d+$')})
for i in a:
print i['href']
`
Upvotes: 1
Reputation: 331
Get the attributes value dynamically using xpath and use the value of that element and boom!
Upvotes: 0
Reputation: 42518
To return all the elements with an id like "p[0-9]+" :
driver.find_elements_by_xpath("//*[starts-with(@id,'p') and substring(@id,2)>=0]")
Upvotes: 1
Reputation: 174696
You may use regex in BeautifulSoup for selecting a particular tag.
>>> from bs4 import BeautifulSoup
>>> html = '''<a href="/string/string-string.html" class="profile-enable" rel="nofollow"
id="p1234"> <a href="/string/string-foo.html" class="profile-enable" rel="nofollow"
id="p1235"> '''
>>> [i['href'] for i in soup.find_all('a', attrs={'id': re.compile('^p\d+$')})]
['/string/string-string.html', '/string/string-foo.html']
or
>>> [i['href'] for i in soup.find_all(attrs={'id': re.compile('^p\d+$')}, href=True)]
['/string/string-string.html', '/string/string-foo.html']
Upvotes: 0