Reputation: 6806

What's the regex for removing dots in acronyms but not in domain names?

I want to remove dots in acronyms but not in domain names in a python string. For example, I want the string

'a.b.c. [email protected] http://www.test.com'

to become

'abc [email protected] http://www.test.com'

The closest regex I made so far is

re.sub('(?:\s|\A).{1}\.',lambda s: s.group()[0:2], s)

which results to

'ab.c. [email protected] http://www.test.com'

It seems that for the above regex to work, I need to change the regex to

(?:\s|\A|\G).{1}\.

but there is no end of match marker (\G) in python.

EDIT: As I have mentioned in my comment, the strings have no specific formatting. These strings contain informal human conversations and so may contain zero, one or several acronyms or domain names. A few errors is fine by me if it would save me from coding a "real" parser.

Upvotes: 1

Answers (6)

Anon

Reputation: 12548

A non-regex way:

>>> S = 'a.b.c. [email protected] http://www.test.com'
>>> ' '.join(w if '@' in w or ':' in w else w.replace('.', '') for w in S.split())
'abc [email protected] http://www.test.com'

(Requires spaces to split on, though - so if you had something like commas with no spaces it could miss some.)

Upvotes: 1

Christian Alis

Reputation: 6806

The following worked for me (with thanks to Bart for his answer):

re.sub('\.(?!(\S[^. ])|\d)', '', s)

This will not remove a dot if it is the first character in a word or acronym.

Upvotes: 1

Bryan Oakley

Reputation: 386285

I suggest you split the string at '@' (or whatever character makes sense), do the substitution on the first part, then put the string back together. I think that will show the intent of the code better than a complex regexp. Something like this, perhaps:

string='a.b.c. [email protected] http://www.test.com'
left, rest = string.split("@",1)
left = left.replace(".","")
result="%s@%s" % (left, rest)

Upvotes: 2

Head Geek

Reputation: 39878

Not as elegant as a simple re.sub(), but try this:

import re

s='a.b.c. [email protected] http://www.test.com'
m=re.search('(.*?)(([a-zA-Z]\.){2,})(.*)', s)

if m:
    replacement=''.join(m.group(2).split('.'))
    s=m.group(1)+replacement+m.group(4)

print s

It assumes that there's no more than one acronym per string, but you could always run it repeatedly.

Upvotes: 0