Qashin
Qashin

Reputation: 125

How to extract partial text from href using BeautifulSoup in Python

Here's my code:

for item in data:

print(item.find_all('td')[2].find('a'))
print(item.find('span').text.strip())
print(item.find_all('td')[3].text)
print(item.find_all('td')[2].find(target="_blank").string.strip())

It prints this text below.

<a href="argument_transcripts/2016/16-399_3f14.pdf" 
id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile" 
target="_blank">16-399. </a>

Perry v. Merit Systems Protection Bd.

04/17/17

16-399.

All I want from the href tag is this part: 16-399_3f14

How can I do that? Thanks.

Upvotes: 1

Views: 1864

Answers (1)

Joe.Ingalls
Joe.Ingalls

Reputation: 196

You can use the find_all to pull the the anchor elements that have the href attribute and then parse the href values for the information that you are looking for.

from BeautifulSoup import BeautifulSoup

html = '''<a href="argument_transcripts/2016/16-399_3f14.pdf" 
id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile" 
target="_blank">16-399. </a>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    url = a['href'].split('/')
    print url[-1]

This should output the the string you are looking for.

16-399_3f14.pdf

Upvotes: 1

Related Questions