Reputation: 173
I am trying to use regular expression over a text that contains some special character like à,è,ù etc.
filter_2 = ur'(?:^\|\s+)?(?:(?:main_interests)|(?:influenced)|(?:influences))\s+?=[\s\W]+?(?:[\w}])*?([\d\w\s\-()*–&;\[\]|.<>:/",\']*)(?=\n)'
compiled = re.compile(filter_2, flags=re.U | re.M)
filter_list = re.findall(compiled, information)
The text below is the result of the evaluation of the expression.
[[Pedro Calderón de la Barca|Calderón]], [[Christian Fürchtegott Gellert|Gellert]], [[Oliver Goldsmith|Goldsmith]], [[Hafez]], [[Johann Gottfried Herder|Herder]], [[Homer]], [[Kālidāsa]], [[Kant]], [[Friedrich Gottlieb Klopstock|Klopstock]], [[Gotthold Ephraim Lessing|Lessing]], [[Carl Linnaeus|Linnaeus]], [[James Macpherson|Macpherson]], [[Jean-Jacques Rousseau|Rousseau]], [[Friedrich Schiller|Schiller]], [[William Shakespeare|Shakespeare]], [[Spinoza]], [[Emanuel Swedenborg|Swedenborg]],[[Karl Robert Mandelkow]], Bodo Morawe: Goethes Briefe. 2. edition. Vol. 1: Briefe der Jahre 1764–1786. ''Christian Wegner'', Hamburg 1968, p. 709 [[Johann Joachim Winckelmann|Winckelmann]]`
Now, when i try to use another regular expression over the above text in order to extrapolate the words in the square brackets, the result is wrong. All the words that represent a special character, like à ù or è, are removed and the result is not the one expected.
filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'
another_compiled = re.compile(filter_6, flags=re.U | re.M)
another_filtered_list = re.findall(another_compiled, (str(filter_list)))
These are my results:
[('Pedro Calder', ''), ('Christian F', ''), ('Oliver Goldsmith', ''), ('Hafez', ''), ('Johann Gottfried Herder', ''), ('Homer', ''), ('K', ''), ('Kant', ''), ('Friedrich Gottlieb Klopstock', ''), ('Gotthold Ephraim Lessing', ''), ('Carl Linnaeus', ''), ('James Macpherson', ''), ('Jean-Jacques Rousseau', ''), ('Friedrich Schiller', ''), ('William Shakespeare', ''), ('Spinoza', ''), ('Emanuel Swedenborg', ''), ('Karl Robert Mandelkow', ''), ('Johann Joachim Winckelmann', ''), ('Thomas Carlyle', ''), ('Ernst Cassirer', ''), ('Charles Darwin', ''), ('Sigmund Freud', ''), ('G', ''), ('Andr', ''), ('Hermann Hesse', ''), ('G.W.F. Hegel', ''), ('Muhammad Iqbal', ''), ('Daisaku Ikeda', ''), ('Carl Gustav Jung', ''), ('Milan Kundera', ''), ('S', ''), ('Jean-Baptiste Lamarck', ''), ('Joaquim Maria Machado de Assis', ''), ('Thomas Mann', ''), ('Friedrich Nietzsche', ''), ('France Pre', ''), ('Grigol Robakidze', ''), ('Friedrich Schiller', ''), ('Oswald Spengler', ''), ('Max Stirner', ''), ('Friedrich Wilhelm Joseph Schelling', ''), ('Arthur Schopenhauer', ''), ('Oswald Spengler', ''), ('Rudolf Steiner', ''), ('Henry David Thoreau', ''), ('Nikola Tesla', ''), ('Ivan Turgenev', ''), ('Ludwig Wittgenstein', ''), ('Richard Wagner', ''), ('Leopold von Ranke', '')]
These are the results i would like to achieve
MATCH 1 1. [2-28]
Pedro Calderón de la Barca
MATCH 2 1. [43-72]Christian Fürchtegott Gellert
MATCH 3 1. [86-102]Oliver Goldsmith
MATCH 4 1. [118-123]Hafez
MATCH 5 1. [129-152]Johann Gottfried Herder
MATCH 6 1. [165-170]Homer
MATCH 7 1. [176-184]Kālidāsa
MATCH 8 1. [190-194]Kant
MATCH 9 1. [200-228]Friedrich Gottlieb Klopstock
MATCH 10 1. [244-268]Gotthold Ephraim Lessing
MATCH 11 1. [282-295]Carl Linnaeus
MATCH 12 1. [310-326]James Macpherson
MATCH 13 1. [343-364]Jean-Jacques Rousseau
MATCH 14 1. [379-397]Friedrich Schiller
MATCH 15 1. [412-431]William Shakespeare
MATCH 16 1. [449-456]Spinoza
MATCH 17 1. [462-480]Emanuel Swedenborg
MATCH 18 1. [501-522]Karl Robert Mandelkow
MATCH 19 1. [659-685]Johann Joachim Winckelmann
All the regular expression are tested online and they work perfectly. There is a way to actually include these special characters?
Upvotes: 1
Views: 174
Reputation: 8610
In Python 3, the regex doesn't compile. This seemed to work for me when I changed:
filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'
to just a unicode (not raw) string:
filter_6 = u'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'
In Python 2, I believe the issue is the casting of the list to a string. Changing str(filter_list)
to ' '.join(filter_list)
seemed to work for me.
Upvotes: 2