Reputation: 313
I'm doing a tokenizer and I want to separate strings like "word-bound-with-hyphen" into "word xxsep bound xxsep with xxsep hyphen".
I tried this:
import re
s = "words-bound-with-hyphen"
reg_m = re.compile("[\w\d]+-[\w\d]+")
reg = re.compile("([\w\d]+)-([\w\d]+)")
while(reg_m.match(s)):
s = reg.sub(r"\1 xxsep \2", s)
print(s) #prints "words xxsep bound-with xxsep hyphen"
But this leaves every third hyphen-bound word.
Upvotes: 2
Views: 167
Reputation: 104
import re
s = "words-bound-with-hyphen"
re.sub('-',' xxsep ',s)
or without using regular expressions
" xxsep ".join(x.split('-'))
here, the list will be separated taking - as delimiter and then joined using "xxsep"
Upvotes: 2
Reputation: 5185
If you don't want to just replace all hyphens but only those that are preceded and followed by certain characters than use regex lookbacks and lookaheads.
import re
s = "words-bound-with-hyphen"
re.sub('(?<=[\w\d])-(?=[\w\d])',' xxsep ', s)
# result: 'words xxsep bound xxsep with xxsep hyphen'
Upvotes: 1
Reputation: 500475
You could just replace the hyphens with a regex:
In [4]: re.sub("-", " xxsep ", "word-bound-with-hyphen")
Out[4]: 'word xxsep bound xxsep with xxsep hyphen'
or with string substitution:
In [7]: "word-bound-with-hyphen".replace("-", " xxsep ")
Out[7]: 'word xxsep bound xxsep with xxsep hyphen'
The reason your current approach doesn't work is that re.sub()
returns non-overlapping groups whereas word-bound
overlaps with bound-with
overlaps with with-hyphen
.
Upvotes: 2