Jacopo Terrinoni
Jacopo Terrinoni

Reputation: 173

regular expression unicode character does not match

I am trying to use regular expression over a text that contains some special character like à,è,ù etc.

filter_2 = ur'(?:^\|\s+)?(?:(?:main_interests)|(?:influenced)|(?:influences))\s+?=[\s\W]+?(?:[\w}])*?([\d\w\s\-()*–&;\[\]|.<>:/",\']*)(?=\n)'
compiled = re.compile(filter_2, flags=re.U | re.M)
filter_list = re.findall(compiled, information)

The text below is the result of the evaluation of the expression.

[[Pedro Calderón de la Barca|Calderón]], [[Christian Fürchtegott Gellert|Gellert]], [[Oliver Goldsmith|Goldsmith]], [[Hafez]], [[Johann Gottfried Herder|Herder]], [[Homer]], [[Kālidāsa]], [[Kant]], [[Friedrich Gottlieb Klopstock|Klopstock]], [[Gotthold Ephraim Lessing|Lessing]], [[Carl Linnaeus|Linnaeus]], [[James Macpherson|Macpherson]], [[Jean-Jacques Rousseau|Rousseau]], [[Friedrich Schiller|Schiller]], [[William Shakespeare|Shakespeare]], [[Spinoza]], [[Emanuel Swedenborg|Swedenborg]],[[Karl Robert Mandelkow]], Bodo Morawe: Goethes Briefe. 2. edition. Vol. 1: Briefe der Jahre 1764–1786. ''Christian Wegner'', Hamburg 1968, p. 709 [[Johann Joachim Winckelmann|Winckelmann]]`

Now, when i try to use another regular expression over the above text in order to extrapolate the words in the square brackets, the result is wrong. All the words that represent a special character, like à ù or è, are removed and the result is not the one expected.

filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'
another_compiled = re.compile(filter_6, flags=re.U | re.M)
another_filtered_list = re.findall(another_compiled, (str(filter_list)))

These are my results:

[('Pedro Calder', ''), ('Christian F', ''), ('Oliver Goldsmith', ''), ('Hafez', ''), ('Johann Gottfried Herder', ''), ('Homer', ''), ('K', ''), ('Kant', ''), ('Friedrich Gottlieb Klopstock', ''), ('Gotthold Ephraim Lessing', ''), ('Carl Linnaeus', ''), ('James Macpherson', ''), ('Jean-Jacques Rousseau', ''), ('Friedrich Schiller', ''), ('William Shakespeare', ''), ('Spinoza', ''), ('Emanuel Swedenborg', ''), ('Karl Robert Mandelkow', ''), ('Johann Joachim Winckelmann', ''), ('Thomas Carlyle', ''), ('Ernst Cassirer', ''), ('Charles Darwin', ''), ('Sigmund Freud', ''), ('G', ''), ('Andr', ''), ('Hermann Hesse', ''), ('G.W.F. Hegel', ''), ('Muhammad Iqbal', ''), ('Daisaku Ikeda', ''), ('Carl Gustav Jung', ''), ('Milan Kundera', ''), ('S', ''), ('Jean-Baptiste Lamarck', ''), ('Joaquim Maria Machado de Assis', ''), ('Thomas Mann', ''), ('Friedrich Nietzsche', ''), ('France Pre', ''), ('Grigol Robakidze', ''), ('Friedrich Schiller', ''), ('Oswald Spengler', ''), ('Max Stirner', ''), ('Friedrich Wilhelm Joseph Schelling', ''), ('Arthur Schopenhauer', ''), ('Oswald Spengler', ''), ('Rudolf Steiner', ''), ('Henry David Thoreau', ''), ('Nikola Tesla', ''), ('Ivan Turgenev', ''), ('Ludwig Wittgenstein', ''), ('Richard Wagner', ''), ('Leopold von Ranke', '')]

These are the results i would like to achieve

MATCH 1 1. [2-28] Pedro Calderón de la Barca MATCH 2 1. [43-72] Christian Fürchtegott Gellert MATCH 3 1. [86-102] Oliver Goldsmith MATCH 4 1. [118-123] Hafez MATCH 5 1. [129-152] Johann Gottfried Herder MATCH 6 1. [165-170] Homer MATCH 7 1. [176-184] Kālidāsa MATCH 8 1. [190-194] Kant MATCH 9 1. [200-228] Friedrich Gottlieb Klopstock MATCH 10 1. [244-268] Gotthold Ephraim Lessing MATCH 11 1. [282-295] Carl Linnaeus MATCH 12 1. [310-326] James Macpherson MATCH 13 1. [343-364] Jean-Jacques Rousseau MATCH 14 1. [379-397] Friedrich Schiller MATCH 15 1. [412-431] William Shakespeare MATCH 16 1. [449-456] Spinoza MATCH 17 1. [462-480] Emanuel Swedenborg MATCH 18 1. [501-522] Karl Robert Mandelkow MATCH 19 1. [659-685] Johann Joachim Winckelmann

All the regular expression are tested online and they work perfectly. There is a way to actually include these special characters?

Upvotes: 1

Views: 174

Answers (1)

Karin
Karin

Reputation: 8610

In Python 3, the regex doesn't compile. This seemed to work for me when I changed:

filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'

to just a unicode (not raw) string:

filter_6 = u'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'

In Python 2, I believe the issue is the casting of the list to a string. Changing str(filter_list) to ' '.join(filter_list) seemed to work for me.

Upvotes: 2

Related Questions