Reputation: 91
I have a list in which each element contains a vector of textual data. In essence, I would like the code to delete text that follows after a regular expression: the second "." in the respective vector.
I believe the gsub-function is a good way to go about this if used in connection with regular expressions. I have tried to formulate the pattern to be detected using a regular expression (see below).
Data:
v<-c("M. le président. La parole est à M. Emile Vernaudon.",
"M.Gabriel Xaaperei. Monsieur le ministre",
"M. Raymond Fornir, rapporteur. La commission")
Code:
Subbed<-gsub("[^((?<=^M. *))]", "X", v)
The code returns the following:
[1] "M. XX XXXXXXXXX. XX XXXXXX XXX. M. XXXXX XXXXXXXXX."
[2] "M. XXXXXXX XXXXXXXXX. MXXXXXXX XX XXXXXXXXX XXX"
[3] "M. XXXXXXX XXXXXX XXXXXXXXXX. XX XXXXXXXXXX"
Not only does the code take all the "M."s into account, but there is also an "M" in the second row although it is not followed by a ".". My hunch is that in gsub regular expressions seem to work differently - the "M." in my code might be read by R as "M|." Also, the ^ after the Lookaround doesn't seem to work as an anchor but simply as an additional punctuation character.
The desired outcome is as follows:
[1] "M. le président."
[2] "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
Any help much appreciated.
Upvotes: 1
Views: 225
Reputation: 269955
1) sub Match the beginning of string (^) and then capture M. . Next match spaces if any and then capture everything up to the next dot. Finally match everything else. Replace that with the first capture (\1), a space and the second capture (\2).
Note that we use sub
rather than gsub
since there is just one overall match per component. Also, it puts a space after the M. even if it did not already have one.
sub("^(M\\.) *([^.]+\\.).*", "\\1 \\2", v)
giving:
[1] "M. le président." "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
2) read.table This solution does not use any regular expressions. We read in v
using dot separated fields and then assemble them back together using sprintf
.
with(read.table(text = v, sep = ".", fill = TRUE, strip.white = TRUE),
sprintf("%s. %s.", V1, V2))
giving:
[1] "M. le président." "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
3) paste/trimws/sub This uses several functions and only one regex which is relatively simple. We take everything from the 3rd character onwards, replace the first dot and everything after it with a dot, trim whitespace in case any is left and paste M. onto the beginning.
paste("M.", trimws(sub("\\..*", ".", substring(v, 3))))
giving:
[1] "M. le président." "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
Add
Upvotes: 3
Reputation: 2867
gsub("^([^.]*.[^.]*).*", "\\1.", v)
[1] "M. le président." "M.Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
Upvotes: 2
Reputation: 7592
You placed your regular expression within square brackets, which R interprets as a group, and then indeed treats everything in that group as "OR". You also preceded that with ^, which makes R treat it as "NOT", so it basically looks for anything but the characters in your search term. Furthermore, you didn't escape your periods. Here's the regex as it should be:
gsub("^(M\\..*?\\.).*","\\1",v)
[1] "M. le président." "M.Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
This looks for M.
(the period is escaped), followed by anything (unescaped .) for an undetermined number of times (*
) which is followed by a second (escaped) period (the ?
is to make sure it's ungreedy, so it doesn't look for the last period, only the next one).
It them returns everything up to there (\\1
), and discards the rest.
Upvotes: 1