Reputation: 139
I need to remove all the urls from a text file. I read the file, I iterate line by line and I write a clean file. however the below code acting weird. It removes the first line of the original file and add new 3 lines in total. Most important it doesn't remove the urls.
import sys
import re
sys.stdout = open('text_clean.txt', 'w')
with open("text.txt",encoding="'Latin-1'") as f:
rep = re.compile(r"""
http[s]?://.*?\s
|www.*?\s
|(\n)
""", re.X)
non_asc = re.compile(r"[^\x00-\x7F]")
for line in f:
non = non_asc.search(line)
if non:
continue
m = rep.search(line)
if m:
line = line.replace(m.group(), "")
if line.strip():
print(line.strip())
Upvotes: 1
Views: 4179
Reputation: 2333
You can replace any match with "" with regex, and it's probably the most efficient way to do it
import re
new_file = open('text_clean.txt', 'w')
with open("text.txt",encoding="'Latin-1'") as f:
text = re.sub(r'(?:(?:http|https):\/\/)?([-a-zA-Z0-9.]{2,256}\.[a-z]{2,4})\b(?:\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?',"",f.read(),flags=re.MULTILINE)
text = '\n'.join([a for a in text.split("\n") if a.strip()])
new_file.write(text)
new_file.close()
Test example I used :
asdas
d
asd
asd
https://www.google.com
http://facebook.com
facebook.com
google.com
dasd.asdasd.asd //this is url too ?
Output:
asdas
d
asd
asd
//this is url too ?
Upvotes: 1