Regex - replace non-unicode characters but only in certain patterns

Question

I'm working with Django, trying to autogenerate templates for a whole bunch of links, some of which have non-unicode characters (e.g. é, ç, etc). While putting those characters in filenames seems to work when I'm browsing on my own computer, django does not like it and refuses to render them. I figured a quick solution to this would be to just regex-replace these characters with underscores or something, but only in the urls where Django would otherwise have problems.

The string I'm trying to parse - the autogenerated template - looks something like this:

desc = """...blah blah blah Link Text Ñôt Unìcodé blah blah blah ..."""

So I want to use regex to change ñôt-unìcodé to __t-un_cod_, while leaving Ñôt Unìcodé untouched. Here's what I've tried:

re.findall(r"'arg_name='(([^'])+?)'", desc)

I intend for this to give me a parseable list of all the individual characters, which could then be replaced on an individual basis via re.sub:

['ñ', 'ô', 't', '-', 'u', 'n', 'ì', 'c', 'o', 'd', 'é', ...]

But instead I end up with the entire string and just the last letter:

[('ñôt-unìcodé', 'é'), ...]

What am I misunderstanding here?

(I've found both parts of this question individually on stackoverflow, in different languages, but not at the same time - I'm having trouble combining those answers, though.)

Julio · Accepted Answer

You are adding + to the capturing group: ([^'])+

Capturing groups 'cannot' be repeated. If they are, you will just get the last occurrence on the capturing group.

So here [('ñôt-unìcodé', 'é'), ...], the first is the whole match. And the last one is the capturing group (the last letter).

Try to use instead arg_name='([^']+)' or even arg_name='[^']+'

Regex - replace non-unicode characters but only in certain patterns

Answers (1)

Related Questions