Reputation: 1727
I have the following string, and I want to parse out the link.
string =
'<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None
So essentially grab everything between 'href=' and '">'
The result should be: /Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml
This is what I've tried:
test = re.search('(?<=href).?(?=.xml)', final_link_str)*
and for kicks and giggles I tried this as well, to grab everything after href,
test = rtest = re.search('(?<=href).', final_link_str)*
No matter what I do, the output is only a part of the entire link.
Here is the result I'm getting:
<re.Match object; span=(23, 163), match='="/Archives/edgar/data/886982/000076999319000460/>
Upvotes: 1
Views: 58
Reputation: 370889
Consider parsing the HTML using BeautifulSoup instead:
from bs4 import BeautifulSoup
string = '<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None'
soup = BeautifulSoup(string, 'html.parser')
href = soup.find('a')['href']
Result:
/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml
Upvotes: 4
Reputation: 4219
This gets what's the value of the href:
>>> string = '<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None'
>>> re.search('href="(.*?)"', string).groups(0)
('/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml',)
>>>
EDIT: As commented by @Jonas Berlin, correct output would be:
>>> v, = re.search('href="(.*?)"', string).groups(0)
>>> v
'/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml'
Upvotes: 0
Reputation: 27723
Just in case, if there would have been undesired spaces before and after:
href="\s*([^"\s]*)\s*"
then, the above expression might be fine.
import re
string = """
<td scope="row"><a href=" /Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml ">InfoTable_2019-08-09_Final.html</a></td>None
"""
expression = r'href="\s*([^"\s]*)\s*"'
matches = re.findall(expression, string)
print(matches)
['/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml']
If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Upvotes: 0