green
green

Reputation: 27

turn long url in short url - replace python

The long urls in myfile.txt must be in turned in short urls. This is in myfile.txt:

26-04-2018 | Publication  2018, 88936 , https://search.publications.com/pgm-2018-88936.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=0&sorttype=1&sortorder=4

19-04-2018 | Publication 2018, 8168 , https://search.publications.com/pgm-2018-8168.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=1&sorttype=1&sortorder=4

26-03-2018 | Publication 2018, 611724 , https://search.publications.com/pgm-2018-611724.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=2&sorttype=1&sortorder=4

01-02-2017 | Publication 2017, 1452026 , https://search.publications.com/pgm-2017-1452026.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=3&sorttype=1&sortorder=4

In python 2.7 there is this code:

import re

with open('myfile.txt', 'r+') as myfile:
    data = myfile.read()
    url = re.findall(r'[^https.+?]', data)
    urlshort = re.findall(r'[^https.+html?]', data)
    for url in data:
        myfile.write(url.replace(url, urlshort, data))
myfile.close()

The output is:

Traceback (most recent call last): File "/pyscripts/data.py", line 9, in myfile.write(url.replace(url, urlshort, data)) TypeError: an integer is required

How to make this work in the file?

Upvotes: 1

Views: 114

Answers (1)

user3483203
user3483203

Reputation: 51185

Use re.sub with (https.*html).*

import re

s = """
26-04-2018 | Publication  2018, 88936 , https://search.publications.com/pgm-2018-88936.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=0&sorttype=1&sortorder=4    
19-04-2018 | Publication 2018, 8168 , https://search.publications.com/pgm-2018-8168.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=1&sorttype=1&sortorder=4  
26-03-2018 | Publication 2018, 611724 , https://search.publications.com/pgm-2018-611724.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=2&sorttype=1&sortorder=4    
01-02-2017 | Publication 2017, 1452026 , https://search.publications.com/pgm-2017-1452026.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=3&sorttype=1&sortorder=4
"""

print(re.sub(r'(https.*html).*', r'\1', s))

Output:

26-04-2018 | Publication  2018, 88936 , https://search.publications.com/pgm-2018-88936.html
19-04-2018 | Publication 2018, 8168 , https://search.publications.com/pgm-2018-8168.html
26-03-2018 | Publication 2018, 611724 , https://search.publications.com/pgm-2018-611724.html
01-02-2017 | Publication 2017, 1452026 , https://search.publications.com/pgm-2017-1452026.html

This way you can just write the entire result of re.sub to your file instead of trying to replace the way you currently doing it.

Upvotes: 1

Related Questions