Reputation: 6806
I want to remove dots in acronyms but not in domain names in a python string. For example, I want the string
'a.b.c. [email protected] http://www.test.com'
to become
'abc [email protected] http://www.test.com'
The closest regex I made so far is
re.sub('(?:\s|\A).{1}\.',lambda s: s.group()[0:2], s)
which results to
'ab.c. [email protected] http://www.test.com'
It seems that for the above regex to work, I need to change the regex to
(?:\s|\A|\G).{1}\.
but there is no end of match marker (\G) in python.
EDIT: As I have mentioned in my comment, the strings have no specific formatting. These strings contain informal human conversations and so may contain zero, one or several acronyms or domain names. A few errors is fine by me if it would save me from coding a "real" parser.
Upvotes: 1
Views: 7405
Reputation: 12548
A non-regex way:
>>> S = 'a.b.c. [email protected] http://www.test.com'
>>> ' '.join(w if '@' in w or ':' in w else w.replace('.', '') for w in S.split())
'abc [email protected] http://www.test.com'
(Requires spaces to split on, though - so if you had something like commas with no spaces it could miss some.)
Upvotes: 1
Reputation: 6806
The following worked for me (with thanks to Bart for his answer):
re.sub('\.(?!(\S[^. ])|\d)', '', s)
This will not remove a dot if it is the first character in a word or acronym.
Upvotes: 1
Reputation: 386285
I suggest you split the string at '@' (or whatever character makes sense), do the substitution on the first part, then put the string back together. I think that will show the intent of the code better than a complex regexp. Something like this, perhaps:
string='a.b.c. [email protected] http://www.test.com'
left, rest = string.split("@",1)
left = left.replace(".","")
result="%s@%s" % (left, rest)
Upvotes: 2
Reputation: 39878
Not as elegant as a simple re.sub()
, but try this:
import re
s='a.b.c. [email protected] http://www.test.com'
m=re.search('(.*?)(([a-zA-Z]\.){2,})(.*)', s)
if m:
replacement=''.join(m.group(2).split('.'))
s=m.group(1)+replacement+m.group(4)
print s
It assumes that there's no more than one acronym per string, but you could always run it repeatedly.
Upvotes: 0
Reputation: 7894
If your data is always formatted like this then why not split your data into 3 parts by splitting on the space.
Then it's pretty trivial to remove the periods from the first element and use join to remerge the parts.
Upvotes: 5
Reputation: 170268
You could simply remove DOTS that don't have two [a-z] letters (or more) ahead of them:
\.(?![a-zA-Z]{2})
But that will of course also remove the first DOT from the following address:
You could fix that by doing:
\.(?![a-zA-Z]{2}|[^\s@]*+@)
but I'm sure there will be many more such corner cases.
Upvotes: 2