Reputation: 1170
I have a dataset of links to newspaper articles that I want to do some research on. However, the links in the dataset end with .ece extension (which is a problem for me because of some api restrictions)
http://www.telegraaf.nl/telesport/voetbal/buitenlands/article22178882.ece
and
http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html
are links to the same page. Now I need to convert all the .ece links into .html links. I didn't find an easier way to do it, but to parse the page and find the original .html link. The problem is that the link is buried inside an html meta element, and I can't get to it using tree.xpath.
<meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html"
Unfortunately, I am not well acquainted with regex, and don't know how to extract a link using it. Basically, every link I need will starts with:
<meta content="http://www.telegraaf.nl/
I need the full link (i.e., http://www.telegraaf.nl/THE_REST_OF_THE_LINK). Also, I'm using BeautifulSoup for parsing. Thanks.
Upvotes: 0
Views: 162
Reputation: 15755
Here is a really simple regex to get you started.
This one will extract all links
\<meta content="(http:\/\/www\.telegraaf\.nl.*)"
This one will match all html links
\<meta content="(http:\/\/www\.telegraaf\.nl.*\.html)"
To use this with what you have, you can do the following:
import urllib2
import re
replacements = dict()
for url in ece_url_list:
response = urllib2.urlopen(url)
html = response.read()
replacements[url] = re.findall('\<meta content="(http:\/\/www\.telegraaf\.nl.*\.html)"', html)[0]
Note: This assumes that each source code page always includes an html link in this meta tag. It expects one and only one.
Upvotes: 1
Reputation: 142176
Use BeautifulSoup to find matching content attributes, then replace as such:
from bs4 import BeautifulSoup
import re
html = """
<meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/article22178882.ece" />
<meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html" />
"""
soup = BeautifulSoup(html)
# reference table of url prefixes to full html link
html_links = {
el['content'].rpartition('/')[0]: el['content']
for el in soup.find_all('meta', content=re.compile('.html$'))
}
# find all ece links, strip the end of to match links, then adjust
# meta content with looked up element
for el in soup.find_all('meta', content=re.compile('.ece$')):
url = re.sub('(?:article(\d+).ece$)', r'\1', el['content'])
el['content'] = html_links[url]
print soup
# <meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html"/>
Upvotes: 1
Reputation: 67968
(.*?)(http:\/\/.*\/.*?\.)(ece)
Try this.Replace by $2html
.
See demo.
http://regex101.com/r/nA6hN9/24
Upvotes: 0