mikelowry
mikelowry

Reputation: 1727

Can't get entire link from string with regex in python

I have the following string, and I want to parse out the link.

string =

'<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None

So essentially grab everything between 'href=' and '">'

The result should be: /Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml

This is what I've tried:

test = re.search('(?<=href).?(?=.xml)', final_link_str)*

and for kicks and giggles I tried this as well, to grab everything after href,

test = rtest = re.search('(?<=href).', final_link_str)*

No matter what I do, the output is only a part of the entire link.

Here is the result I'm getting:

<re.Match object; span=(23, 163), match='="/Archives/edgar/data/886982/000076999319000460/>

Upvotes: 1

Views: 58

Answers (3)

CertainPerformance
CertainPerformance

Reputation: 370889

Consider parsing the HTML using BeautifulSoup instead:

from bs4 import BeautifulSoup

string = '<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None'
soup = BeautifulSoup(string, 'html.parser')
href = soup.find('a')['href']

Result:

/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml

Upvotes: 4

dcg
dcg

Reputation: 4219

This gets what's the value of the href:

>>> string = '<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None'
>>> re.search('href="(.*?)"', string).groups(0)
('/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml',)
>>> 

EDIT: As commented by @Jonas Berlin, correct output would be:

>>> v, = re.search('href="(.*?)"', string).groups(0)
>>> v        
'/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml'

Upvotes: 0

Emma
Emma

Reputation: 27723

Just in case, if there would have been undesired spaces before and after:

href="\s*([^"\s]*)\s*"

then, the above expression might be fine.

Test

import re

string = """
<td scope="row"><a href=" /Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml ">InfoTable_2019-08-09_Final.html</a></td>None
"""

expression = r'href="\s*([^"\s]*)\s*"'
matches = re.findall(expression, string)

print(matches)

Output

['/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Upvotes: 0

Related Questions