Reputation: 6781
So, for input:
accessibility,random good bye
I want output:
a11y,r4m g2d bye
So, basically, I have to abbreviate all words of length greater than or equal to 4 in the following format: first_letter + length_of_all_letters_in_between + last_letter
I try to do this:
re.sub(r"([A-Za-z])([A-Za-z]{2,})([A-Za-z])", r"\1" + str(len(r"\2")) + r"\3", s)
But it does not work. In JS
, I would easily do:
str.replace(/([A-Za-z])([A-Za-z]{2,})([A-Za-z])/g, function(m, $1, $2, $3){
return $1 + $2.length + $3;
});
How do I do the same in Python?
EDIT: I cannot afford to lose any punctuation present in original string.
Upvotes: 8
Views: 1576
Reputation: 107357
As an alternative precise way you can use a separate function for re.sub
and use the simple regex r"(\b[a-zA-Z]+\b)"
.
>>> def replacer(x):
... g=x.group(0)
... if len(g)>3:
... return '{}{}{}'.format(g[0],len(g)-2,g[-1])
... else :
... return g
...
>>> re.sub(r"(\b[a-zA-Z]+\b)", replacer, s)
'a11y,r4m g2d bye'
Also as a pythonic and general way, to get the replaced words within a list you can use a list comprehension using re.finditer
:
>>> from operator import sub
>>> rep=['{}{}{}'.format(i.group(0)[0],abs(sub(*i.span()))-2,i.group(0)[-1]) if len(i.group(0))>3 else i.group(0) for i in re.finditer(r'(\w+)',s)]
>>> rep
['a11y', 'r4m', 'g2d', 'bye']
The re.finditer
will returns a generator contains all matchobjects
then you can iterate over it and get the start and end of matchobject
s with span()
method.
Upvotes: 1
Reputation: 174874
Keep it simple...
>>> s = "accessibility,random good bye"
>>> re.sub(r'\B[A-Za-z]{2,}\B', lambda x: str(len(x.group())), s)
'a11y,r4m g2d bye'
\B
which matches between two word characters or two non-word chars helps to match all the chars except first and last.
Upvotes: 1
Reputation: 98118
Using regex and comprehension:
import re
s = "accessibility,random good bye"
print "".join(w[0]+str(len(w)-2)+w[-1] if len(w) > 3 else w for w in re.split("(\W)", s))
Gives:
a11y,r4m g2d bye
Upvotes: 0
Reputation: 180540
tmp, out = "",""
for ch in s:
if ch.isspace() or ch in {",", "."}:
out += "{}{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1], ch) if len(tmp) > 3 else tmp + ch
tmp = ""
else:
tmp += ch
out += "{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1]) if len(tmp) > 3 else tmp
print(out)
a11y,r4m g2d bye
If you only want alpha characters use str.isalpha:
tmp, out = "", ""
for ch in s:
if not ch.isalpha():
out += "{}{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1], ch) if len(tmp) > 3 else tmp + ch
tmp = ""
else:
tmp += ch
out += "{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1]) if len(tmp) > 3 else tmp
print(out)
a11y,r4m g2d bye
The logic is the same for both, it is just what we check for that differs, if not ch.isalpha()
is False we found a non alpha character so we need to process the tmp string and add it to out output string. if len(tmp)
is not greater than 3
as per the requirement we just add the tmp string plus the current char to our out string.
We need a final out += "{}{}{}
outside the loop to catch when a string does not end in a comma, space etc.. If the string did end in a non-alpha we would be adding an empty string so it would make no difference to the output.
It will preserve punctuation and spaces:
s = "accessibility,random good bye !! foobar?"
def func(s):
tmp, out = "", ""
for ch in s:
if not ch.isalpha():
out += "{}{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1], ch) if len(tmp) > 3 else tmp + ch
tmp = ""
else:
tmp += ch
return "{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1]) if len(tmp) > 3 else tmp
print(func(s,3))
a11y,r4m g2d bye !! f4r?
Upvotes: 2
Reputation: 104852
The issue you're running into is that len(r'\2')
is always 2
, not the length of the second capturing group in your regular expression. You can use a lambda
expression to create a function that works just like the code you would use in JavaScript:
re.sub(r"([A-Za-z])([A-Za-z]{2,})([A-Za-z])",
lambda m: m.group(1) + str(len(m.group(2)) + m.group(3),
s)
The m
argument to the lambda is a match
object, and the calls to its group
method are equivalent to the backreferences you were using before.
It might be easier to just use a simple word matching pattern with no capturing groups (group()
can still be called with no argument to get the whole matched text):
re.sub(r'\w{4,}', lambda m: m.group()[0] + str(len(m.group())-2) + m.group()[-1], s)
Upvotes: 3
Reputation: 1473
What you are doing in JavaScript is certainly right, you are passing an anonymous function. What you do in Python is to pass a constant expression ("\12\3", since len(r"\2")
is evaluated before the function call), it is not a function that can be evaluated for each match!
While anonymous functions in Python aren't quite as useful as they are in JS, they do the job here:
>>> import re
>>> re.sub(r"([A-Za-z])([A-Za-z]{2,})([A-Za-z])", lambda m: "{}{}{}".format(m.group(1), len(m.group(2)), m.group(3)), "accessability, random good bye")
'a11y, r4m g2d bye'
What happens here is that the lambda is called for each substitution, taking a match object. I then retrieve the needed information and build a substitution string from that.
Upvotes: 8
Reputation: 336
Have a look at the following code
sentence = "accessibility,random good bye"
sentence = sentence.replace(',', " ")
sentence_list = sentence.split(" ")
for item in sentence_list:
if len(item) >= 4:
print item[0]+str(len(item[1:len(item)-1]))+item[len(item)-1]
The only thing you should take care of comma and other punctuation characters.
Upvotes: -1