Reputation: 41
basically i want to write a script that will sanitize a URL, will replace the dots with "(dot)" string. For example if i have http://www.google.com after i run the script i would like it to be http://www(dot)google(dot). Well this is pretty easy to achieve with .replace when my text file consist only of urls or other strings, but in my case i also have IP addresses inside my text file, and i dont want the dots in the IP address to change to "(dot)".
I tried to do this using regex, but my output is " http://ww(dot)oogl(dot)om 192.60.10.10 33.44.55.66 "
This is my code
from __future__ import print_function
import sys
import re
nargs = len(sys.argv)
if nargs < 2:
sys.exit('You did not specify a file')
else:
inputFile = sys.argv[1]
fp = open(inputFile)
content = fp.read()
replace = '(dot)'
regex = '[a-z](\.)[a-z]'
print(re.sub(regex, replace, content, re.M| re.I| re.DOTALL))
I guess i need to have a condition which checks that if the pattern is number.number - dont replace.
Upvotes: 2
Views: 1505
Reputation: 745
Judging by your code, you were hoping to replace the first group within your pattern . However, re.sub
replaces the entire matching pattern, rather than a group. In your case this is the single character right before the period, the period itself and the single character after it.
Even if sub worked as you hoped, your regex would be missing edge cases where numbers are part of URLs, such as www.2048game.com
.
Defining what an IP looks like is far easier. It's always a set of four numbers with one, two or three digits each, separated by dots. (In the case of IPv4, anyway. But IPv6 does not use periods, so it doesn't matter here.)
Assuming you have only URLs and IPs in your text file, simply filter out all IPs and then replace the periods in the remaining URLs:
is_ip = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
urls = content.split(" ")
for i, url in enumerate(urls):
if not is_ip.match(url):
urls[i] = url.replace('.', '(dot)')
content = ' '.join(urls)
Of course, if you have regular text in content
, this will also replace all regular periods, not just URLs. In that case, you would first require a more intricate URL detection. See In search of the perfect URL validation regex
Upvotes: 1
Reputation: 180441
You can use lookahead and lookbehind assertions:
import re
s = "http://www.google.com 127.0.0.1"
print(re.sub("(?<=[a-z])\.(?=[a-z])", "(dot)", s))
http://www(dot)google(dot)com 127.0.0.1
To work for letters and digits this should hopefully do the trick, making sure there is at least one letter:
s = "http://www.googl1.2com 127.0.0.1"
print(re.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)", s, re.I))
http://www(dot)googl1(dot)2com 127.0.0.1
For your file you need re.M
:
In [1]: cat test.txt
google8.com
google9.com
192.60.10.10
33.44.55.66
google10.com
192.168.1.1
google11.com
In [2]: with open("test.txt") as f:
...: import re
...: print(re.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)", f.read(), re.I|re.M))
...:
google8(dot)com
google9(dot)com
192.60.10.10
33.44.55.66
google10(dot)com
192.168.1.1
google11(dot)com
If the files were large and memory was an issue you could also do it line by line, either storing all the lines in a list or using each line as you go:
import re
with open("test.txt") as f:
r = re.compile("(?=.*[a-z])(?<=\w)\.(?=\w)", re.I)
lines = [r.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)") for line in f]
Upvotes: 3
Reputation: 2327
You have to store the [a-z]
content before and after the dot, to put it again in the replaced string. Here how I solved it:
from __future__ import print_function
import sys
import re
nargs = len(sys.argv)
if nargs < 2:
sys.exit('You did not specify a file')
else:
inputFile = sys.argv[1]
fp = open(inputFile)
content = fp.read()
replace = '\\1(dot)\\3'
regex = '(.*[a-z])(\.)([a-z].*)'
print(re.sub(regex, replace, content, re.M| re.I| re.DOTALL))
Upvotes: 0
Reputation: 350
import re
content = "I tried to do this using regex, but my output is http://www.googl.com 192.60.10.10 33.44.55.66\nhttp://ya.ru\n..."
reg = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
all_urls = re.findall(reg, content, re.M| re.I| re.DOTALL)
repl_urls = [u.replace('.', '(dot)') for u in all_urls]
for u, r in zip(all_urls, repl_urls):
content = content.replace(u, r)
print content
Upvotes: 0