Reputation: 2952
I'd like to remove all internal links from a bunch of .html files. The basic idea is that anything starting with <a href=
is a link and if that does not start with <a href="http
it's an internal link.
I'm trying to write a tiny Python script in order to accomplish this. Now the first half of each file gets done perfectly but it consistently crashes on the same link. I obviously checked for typos or missing </a>
's but I don't see any. If I rerun the script, the "problem link" gets removed but its </a>
stays in. It seems more and more links get removed by rerunning the script but I'd like all internal links to be chopped out in one run.
Does anybody have a suggestion what I'm doing wrong? Please see below for the code I'm using.
tList = [r"D:\@work\projects_2013\@websites\pythonforspss\a44\@select-variables-having-pattern-in-names.html"]
for path in tList:
readFil = open(path,"r")
writeFil = open(path[:path.rfind("\\") +1] + "@" + path[path.rfind("\\") + 1:],"w")
flag = 0
for line in readFil:
for ind in range(len(line)):
if flag == 0:
try:
if line[ind:ind + 8].lower() == '<a href=' and line[ind:ind + 13].lower() != '<a href="http':
flag = 1
sLine = line[ind:]
link = sLine[:sLine.find(">") + 1]
line = line.replace(link,"")
print link
except:
pass
if flag == 1:
try:
if line[ind:ind + 4].lower() == '</a>':
flag = 0
line = line.replace('</a>',"")
print "</a>"
except:
pass
writeFil.write(line)
readFil.close()
writeFil.close()
Upvotes: 0
Views: 1626
Reputation: 23
query = input('Enter the word to be searched:')
url = 'https://google.com/search?q=' + query
request_result = req.get(url).text
soup = BS(request_result, 'lxml')
for link in soup.find_all('a', href= re.compile("https://")):
print(link['href'].replace("/url?q=",""))
I have used the code above in Beautiful Soup and successful in returning the https links only.
I tried the solution posted above and it does not work for me in fact my links get reduced to great extent after using the code above.
Hope this helps!
Upvotes: 0
Reputation: 880587
Use an HTML parser like BeautifulSoup or lxml. Using lxml, you might do something like this:
import lxml.html as LH
url = 'http://stackoverflow.com/q/15186769/190597'
doc = LH.parse(url)
# Save a copy of the original just to compare with the altered version, below
with open('/tmp/orig.html', 'w') as f:
f.write(LH.tostring(doc))
for atag in doc.xpath('//a[not(starts-with(@href,"http"))]'):
parent = atag.getparent()
parent.remove(atag)
with open('/tmp/altered.html', 'w') as f:
f.write(LH.tostring(doc))
The equivalent in BeautifulSoup looks like this:
import bs4 as bs
import urllib2
url = 'http://stackoverflow.com/q/15186769/190597'
soup = bs.BeautifulSoup(urllib2.urlopen(url))
with open('/tmp/orig.html', 'w') as f:
f.write(str(soup))
for atag in soup.find_all('a', {'href':True}):
if not atag['href'].startswith('http'):
atag.extract()
with open('/tmp/altered.html', 'w') as f:
f.write(str(soup))
Upvotes: 1