Reputation: 273
I am trying to download a list of xls files from a url using urllib.urlretrieve (python 2.7). I am able to get the file, however there is a <script>
tag at the top of the file making it unreadable in excel.
here is what I have:
import urllib
files= ['a','b', 'c', 'd', 'e', 'f']
url = 'http://www.thewebsite.com/data/dl_xls.php?bid='
for f in files:
urllib.urlretrieve(url + f, f + '.xls')
This downloads an xls file with the following at the top:
<script>parent.parent.location.href = '../../../../a';</script>
which make it unreadable in excel.
If I remove that script tag from the xls, the file opens correctly in excel.
EDIT - Here is my solution from pypypy:
import urllib
files= ['a','b', 'c', 'd', 'e', 'f']
url = 'http://www.thewebsite.com/data/dl_xls.php?bid='
for f in files:
input_xls = f + '_in.xls'
urllib.urlretrieve(url + f, input_xls)
output = open(f + '_out.xls', "wb")
with open(input_xls, "rb") as i:
output.write(re.sub('<script>.*</script>', "", i.read(), re.I))
i.close()
output.close()
Upvotes: 0
Views: 787
Reputation: 1105
Try building a Regex to match the script tag and remove it i.e
import re
re.sub('<script>.*</script>', "", content, re.I)
This will substitute any script tags in the content for "".
Upvotes: 1