Reputation: 28505
In R I have a column which should contain only one word. It is created by taking the contents of another column and with regex only keeping the last word. However, for some rows this doesn't work in which case R simply copies the content from the first column. Here is my R
df$precedingWord <- gsub(".*?\\W*(\\w+-?)\\W*$","\\1", df$leftContext, perl=TRUE)
precedingWord should only hold one word. It is extracted from leftContext with regex. This works fine overall, but not with diacritics. A couple of rows in leftContext have letters with diacritics such as é
and à
. For some reason R ignores these items completely and simply copies the whole thing to precedingWord. I find this odd, because it is practically impossible that the regex matches the whole thing - as you can see here. In the example, Test string is leftContext and Substitution should be *precedingWord.
As you see in the example above, the output in the online regex tester is different from the output I get. I simply get an exact copy of leftContext. This does not mean that the output in the online tester is what want. Now the tool considers letters with diacritics as non-word characters and thus it doesn't mark it as the output that I want. But actually, I want to threat them as word characters so they are eligible for output.
If this is the input:
Un premier projet prévoit que l'établissement verserait 11 FF par an et par élève du secondaire et 30 FF par étudiant universitaire, une somme à évaluer et à
Outre le prêt-à-
And à
Sur base de ces données, on cherchera à
Ce sera encore le cas ce vendredi 19 juillet dans l'é
Then this is the output I expect
à
prêt-à-
à
à
é
This is the regex I already have
.*?\W*(\w+?-?)\W*$
I'm already using stringi in my project, so if that provides a solution I could use that.
Upvotes: 3
Views: 326
Reputation: 627082
In Perl-like regex, you can match any Unicode letter with \p{L}
shorthand class, and all characters that are non-Unicode can be matched with the reverse class \P{L}
. See regular-expressions.info:
You can match a single character belonging to the "letter" category with
\p{L}
. You can match a single character not belonging to that category with\P{L}
.
Thus, the regex you can use is
df$precedingWord <- gsub(".*?\\P{L}*(\\p{L}+-?)\\P{L}*$","\\1", df$leftContext, perl=TRUE)
Upvotes: 1