Christian Alis
Christian Alis

Reputation: 6806

What's the regex for removing dots in acronyms but not in domain names?

I want to remove dots in acronyms but not in domain names in a python string. For example, I want the string

'a.b.c. [email protected] http://www.test.com'

to become

'abc [email protected] http://www.test.com'

The closest regex I made so far is

re.sub('(?:\s|\A).{1}\.',lambda s: s.group()[0:2], s)

which results to

'ab.c. [email protected] http://www.test.com'

It seems that for the above regex to work, I need to change the regex to

(?:\s|\A|\G).{1}\.

but there is no end of match marker (\G) in python.

EDIT: As I have mentioned in my comment, the strings have no specific formatting. These strings contain informal human conversations and so may contain zero, one or several acronyms or domain names. A few errors is fine by me if it would save me from coding a "real" parser.

Upvotes: 1

Views: 7405

Answers (6)

Anon
Anon

Reputation: 12548

A non-regex way:

>>> S = 'a.b.c. [email protected] http://www.test.com'
>>> ' '.join(w if '@' in w or ':' in w else w.replace('.', '') for w in S.split())
'abc [email protected] http://www.test.com'

(Requires spaces to split on, though - so if you had something like commas with no spaces it could miss some.)

Upvotes: 1

Christian Alis
Christian Alis

Reputation: 6806

The following worked for me (with thanks to Bart for his answer):

re.sub('\.(?!(\S[^. ])|\d)', '', s)

This will not remove a dot if it is the first character in a word or acronym.

Upvotes: 1

Bryan Oakley
Bryan Oakley

Reputation: 386285

I suggest you split the string at '@' (or whatever character makes sense), do the substitution on the first part, then put the string back together. I think that will show the intent of the code better than a complex regexp. Something like this, perhaps:

string='a.b.c. [email protected] http://www.test.com'
left, rest = string.split("@",1)
left = left.replace(".","")
result="%s@%s" % (left, rest)

Upvotes: 2

Head Geek
Head Geek

Reputation: 39878

Not as elegant as a simple re.sub(), but try this:

import re

s='a.b.c. [email protected] http://www.test.com'
m=re.search('(.*?)(([a-zA-Z]\.){2,})(.*)', s)

if m:
    replacement=''.join(m.group(2).split('.'))
    s=m.group(1)+replacement+m.group(4)

print s

It assumes that there's no more than one acronym per string, but you could always run it repeatedly.

Upvotes: 0

chollida
chollida

Reputation: 7894

If your data is always formatted like this then why not split your data into 3 parts by splitting on the space.

Then it's pretty trivial to remove the periods from the first element and use join to remerge the parts.

Upvotes: 5

Bart Kiers
Bart Kiers

Reputation: 170268

You could simply remove DOTS that don't have two [a-z] letters (or more) ahead of them:

\.(?![a-zA-Z]{2})

But that will of course also remove the first DOT from the following address:

[email protected]

You could fix that by doing:

\.(?![a-zA-Z]{2}|[^\s@]*+@)

but I'm sure there will be many more such corner cases.

Upvotes: 2

Related Questions