ganesa75
ganesa75

Reputation: 139

Remove URLs from a text file

I need to remove all the urls from a text file. I read the file, I iterate line by line and I write a clean file. however the below code acting weird. It removes the first line of the original file and add new 3 lines in total. Most important it doesn't remove the urls.

import sys
import re

sys.stdout = open('text_clean.txt', 'w')

with open("text.txt",encoding="'Latin-1'") as f:
    rep = re.compile(r"""
                        http[s]?://.*?\s
                        |www.*?\s
                        |(\n)
                        """, re.X)
    non_asc = re.compile(r"[^\x00-\x7F]")
    for line in f:
        non = non_asc.search(line)
        if non:
            continue
        m = rep.search(line)
        if m:
            line = line.replace(m.group(), "")
            if line.strip():
                print(line.strip())      

Upvotes: 1

Views: 4179

Answers (1)

Marko Mackic
Marko Mackic

Reputation: 2333

You can replace any match with "" with regex, and it's probably the most efficient way to do it

import re
new_file = open('text_clean.txt', 'w')
with open("text.txt",encoding="'Latin-1'") as f:
    text = re.sub(r'(?:(?:http|https):\/\/)?([-a-zA-Z0-9.]{2,256}\.[a-z]{2,4})\b(?:\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?',"",f.read(),flags=re.MULTILINE)
    text = '\n'.join([a for a in text.split("\n") if a.strip()])
    new_file.write(text)

new_file.close()   

Test example I used :

asdas
d
asd
asd
https://www.google.com
http://facebook.com
facebook.com
google.com
dasd.asdasd.asd //this is url too ? 

Output:

asdas
d
asd
asd
 //this is url too ? 

Upvotes: 1

Related Questions