its30
its30

Reputation: 273

urllib download excel file from php link

I am trying to download a list of xls files from a url using urllib.urlretrieve (python 2.7). I am able to get the file, however there is a <script> tag at the top of the file making it unreadable in excel.

here is what I have:

import urllib

files= ['a','b', 'c', 'd', 'e', 'f']

url = 'http://www.thewebsite.com/data/dl_xls.php?bid='

for f in files:
    urllib.urlretrieve(url + f, f + '.xls')

This downloads an xls file with the following at the top: <script>parent.parent.location.href = '../../../../a';</script> which make it unreadable in excel.

If I remove that script tag from the xls, the file opens correctly in excel.

EDIT - Here is my solution from pypypy:

import urllib

files= ['a','b', 'c', 'd', 'e', 'f']

url = 'http://www.thewebsite.com/data/dl_xls.php?bid='

for f in files:
    input_xls =  f + '_in.xls'
    urllib.urlretrieve(url + f, input_xls)
    output = open(f + '_out.xls', "wb")
    with open(input_xls, "rb") as i:
        output.write(re.sub('<script>.*</script>', "", i.read(), re.I))
        i.close()
        output.close()

Upvotes: 0

Views: 787

Answers (1)

pypypy
pypypy

Reputation: 1105

Try building a Regex to match the script tag and remove it i.e

import re
re.sub('<script>.*</script>', "", content, re.I)

This will substitute any script tags in the content for "".

Upvotes: 1

Related Questions