Reputation: 1326
I tried finding similar questions, and of formulating a solution on my own. However, I am not very satisfied, and so, decided to ask the question here.
Aim:
I want to remove some expressions ("c(\", and \"a\") that appear at the start and end of my strings, using regular expressions
and gsub
.
#test strings 1 and 2
string1<- "c(\"can't remember the last time\" \"\\a\")"
string2<- "c(\"can't remember the last time\" \"a\")"
#Attempted solution for string1
string1<- gsub("^.\\(","",string1)
string1<- gsub("\\\\.","",string1)
#Result
string1
> "\"can't remember the last time\" \"\")"
Question 1: How can I remove the remaining backslashes withour running into the trailing backslash problem? I cannot use [[:punct:]]
as that removes other punctuation marks too.
#Attempted solution for string2
string2<- gsub("^.\\(","",string2)
string2<- gsub(".\\{1}","",string2)
#Result
string2
> "\"can't remember the last time\" \"a\")"
Question 2: How can I remove the 'a\' expression and the remaining backslashes?
PS. The strings were acquired as a result of exporting data from a Word document's tables to text files using Java, and then importing the text files into R
. But I just want to see how regular expressions
can be used to clean this mess, instead of finding some issue with the Java program that exported the data.
Thanks.
EDIT: Apologies for not making the question clear. This is how I would like the final sentence to be:
"can't remember the last time"
2nd-EDIT
The story of the strange string: The strings shown above were selected from a corpus, which I built using the tm
package, with the DirSource
command. The original text was saved in MS Word, in tabular form. I exported it using Java to create text files for each string, and the imported them to R.
The dput, if it helps, is as follows
structure(c("Can't remember the last time",
"\a"), Author = character(0), DateTimeStamp = structure(list(
sec = 40.6046140193939, min = 56L, hour = 13L, mday = 29L,
mon = 5L, year = 113L, wday = 6L, yday = 179L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "comment1.txt", Language = "english", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character"))
"\a"), Author = character(0), DateTimeStamp = structure(list(
sec = 40.7186260223389, min = 56L, hour = 13L, mday = 29L,
mon = 5L, year = 113L, wday = 6L, yday = 179L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "comment99.txt", Language = "english", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character"))
I can see the "c(" and "\a" in the code above.
Upvotes: 1
Views: 1228
Reputation: 57696
If the two substrings at the start and the end are fixed for all strings, you don't need regexes at all. Just use substr
:
substr(string2, 4, nchar(string2) - 6)
If the substring at the end is variable, but can only contain backslashes, double quotes and a
, the regex is:
"[\\\\ \"a]*)$"
Thus we can use sub
as follows:
sub("[\\\\ \"a]*)$", "", substr(string1, 4, nchar(string1)))
Upvotes: 3
Reputation: 7659
As @Mark Miller points out, your question is not very clear. But I guess that
library( stringr )
str_replace_all( string1, '\\"', "" )
solves your first problem and then
string2 <- str_replace_all( string2, '\\"a', "" )
str_replace_all( string2, '\\"', "" )
str_replace( str2, ')', "" )
the second.
Upvotes: 2