baal_imago
baal_imago

Reputation: 313

Python regex loop skipping every third item

I'm doing a tokenizer and I want to separate strings like "word-bound-with-hyphen" into "word xxsep bound xxsep with xxsep hyphen".

I tried this:

import re

s = "words-bound-with-hyphen"
reg_m = re.compile("[\w\d]+-[\w\d]+")
reg = re.compile("([\w\d]+)-([\w\d]+)")
while(reg_m.match(s)):
    s = reg.sub(r"\1 xxsep \2", s)
print(s) #prints "words xxsep bound-with xxsep hyphen"

But this leaves every third hyphen-bound word.

Upvotes: 2

Views: 167

Answers (3)

Rohit Dwivedi
Rohit Dwivedi

Reputation: 104

import re
s = "words-bound-with-hyphen"
re.sub('-',' xxsep ',s)

or without using regular expressions

" xxsep ".join(x.split('-'))

here, the list will be separated taking - as delimiter and then joined using "xxsep"

Upvotes: 2

blues
blues

Reputation: 5185

If you don't want to just replace all hyphens but only those that are preceded and followed by certain characters than use regex lookbacks and lookaheads.

import re
s = "words-bound-with-hyphen"
re.sub('(?<=[\w\d])-(?=[\w\d])',' xxsep ', s)
# result: 'words xxsep bound xxsep with xxsep hyphen'

Upvotes: 1

NPE
NPE

Reputation: 500475

You could just replace the hyphens with a regex:

In [4]: re.sub("-", " xxsep ", "word-bound-with-hyphen")
Out[4]: 'word xxsep bound xxsep with xxsep hyphen'

or with string substitution:

In [7]: "word-bound-with-hyphen".replace("-", " xxsep ")
Out[7]: 'word xxsep bound xxsep with xxsep hyphen'

The reason your current approach doesn't work is that re.sub() returns non-overlapping groups whereas word-bound overlaps with bound-with overlaps with with-hyphen.

Upvotes: 2

Related Questions