Reputation: 24721
I'm working with Django, trying to autogenerate templates for a whole bunch of links, some of which have non-unicode characters (e.g. é
, ç
, etc). While putting those characters in filenames seems to work when I'm browsing on my own computer, django does not like it and refuses to render them. I figured a quick solution to this would be to just regex-replace these characters with underscores or something, but only in the urls where Django would otherwise have problems.
The string I'm trying to parse - the autogenerated template - looks something like this:
desc = """...blah blah blah <a href="{% url 'myproject:do_thing' arg_name='ñôt-unìcodé' %}">Link Text Ñôt Unìcodé</a> blah blah blah ..."""
So I want to use regex to change ñôt-unìcodé
to __t-un_cod_
, while leaving Ñôt Unìcodé
untouched. Here's what I've tried:
re.findall(r"'arg_name='(([^'])+?)'", desc)
I intend for this to give me a parseable list of all the individual characters, which could then be replaced on an individual basis via re.sub
:
['ñ', 'ô', 't', '-', 'u', 'n', 'ì', 'c', 'o', 'd', 'é', ...]
But instead I end up with the entire string and just the last letter:
[('ñôt-unìcodé', 'é'), ...]
What am I misunderstanding here?
(I've found both parts of this question individually on stackoverflow, in different languages, but not at the same time - I'm having trouble combining those answers, though.)
Upvotes: 0
Views: 358
Reputation: 5308
You are adding +
to the capturing group: ([^'])+
Capturing groups 'cannot' be repeated. If they are, you will just get the last occurrence on the capturing group.
So here [('ñôt-unìcodé', 'é'), ...]
, the first is the whole match. And the last one is the capturing group (the last letter).
Try to use instead arg_name='([^']+)'
or even arg_name='[^']+'
Upvotes: 1