Newbe
Newbe

Reputation: 127

Python Selenium extract a href based on id value

I am trying to extract href values based on a unique id(s) where the digits after p vary but are all numerals and terminate in "

for example id="p4423234" id="p5547" id="p4124234" id="234"

<a href="/string/string-string.html" class="profile-enable" rel="nofollow" 
id="p1234">  `

`

I can grep the value of p using

cat p_id.html | grep "id=\"p[0-9]\+\""

But I am unable to figure out how to return the href value with find_element_by_id in python selenium.

Thank you in advance for your help. I am new to web scraping but enjoying the challenge.

Upvotes: 1

Views: 407

Answers (4)

Sayed Zainul Abideen
Sayed Zainul Abideen

Reputation: 764

Extending Avinash Raj answer:

`


import re
from bs4 import BeautifulSoup
# from selenium import webdrive
# driver = webdriver.Firefox()
# driver.get("http://example.com")

html = '''<a href="/string/string-string.html" class="profile-enable" rel="nofollow"  id="p154234"> 
         <a href="/string/string-foo.html" class="profile-enable" rel="nofollow"  id="p1235">
         <a href="/string/stricccng-bar.html" class="profile-enable" rel="nofollow"  id="12555">
'''

#or

#html = driver.page_source

soup = BeautifulSoup(html)
# it will cover all cases id="p4423234" id="p5547" id="p4124234" id="234"

a =  soup.find_all('a', attrs={'id': re.compile('^p?\d+$')})
for i in a:
    print i['href']

`

Upvotes: 1

fndg87
fndg87

Reputation: 331

Get the attributes value dynamically using xpath and use the value of that element and boom!

Upvotes: 0

Florent B.
Florent B.

Reputation: 42518

To return all the elements with an id like "p[0-9]+" :

driver.find_elements_by_xpath("//*[starts-with(@id,'p') and substring(@id,2)>=0]")

Upvotes: 1

Avinash Raj
Avinash Raj

Reputation: 174696

You may use regex in BeautifulSoup for selecting a particular tag.

>>> from bs4 import BeautifulSoup
>>> html = '''<a href="/string/string-string.html" class="profile-enable" rel="nofollow" 
id="p1234"> <a href="/string/string-foo.html" class="profile-enable" rel="nofollow" 
id="p1235"> '''
>>> [i['href'] for i in soup.find_all('a', attrs={'id': re.compile('^p\d+$')})]
['/string/string-string.html', '/string/string-foo.html']

or

>>> [i['href'] for i in soup.find_all(attrs={'id': re.compile('^p\d+$')}, href=True)]
['/string/string-string.html', '/string/string-foo.html']

Upvotes: 0

Related Questions