info_seekeR
info_seekeR

Reputation: 1326

Using regular expressions to remove (complicated?) string patterns

I tried finding similar questions, and of formulating a solution on my own. However, I am not very satisfied, and so, decided to ask the question here.

Aim: I want to remove some expressions ("c(\", and \"a\") that appear at the start and end of my strings, using regular expressions and gsub.

#test strings 1 and 2
string1<- "c(\"can't remember the last time\" \"\\a\")"
string2<- "c(\"can't remember the last time\" \"a\")"

#Attempted solution for string1
string1<- gsub("^.\\(","",string1)
string1<- gsub("\\\\.","",string1)

#Result
string1
> "\"can't remember the last time\" \"\")"

Question 1: How can I remove the remaining backslashes withour running into the trailing backslash problem? I cannot use [[:punct:]] as that removes other punctuation marks too.

#Attempted solution for string2
string2<- gsub("^.\\(","",string2)
string2<- gsub(".\\{1}","",string2)

#Result
string2
> "\"can't remember the last time\" \"a\")"

Question 2: How can I remove the 'a\' expression and the remaining backslashes?

PS. The strings were acquired as a result of exporting data from a Word document's tables to text files using Java, and then importing the text files into R. But I just want to see how regular expressions can be used to clean this mess, instead of finding some issue with the Java program that exported the data.

Thanks.

EDIT: Apologies for not making the question clear. This is how I would like the final sentence to be:

"can't remember the last time"

2nd-EDIT

The story of the strange string: The strings shown above were selected from a corpus, which I built using the tm package, with the DirSource command. The original text was saved in MS Word, in tabular form. I exported it using Java to create text files for each string, and the imported them to R. The dput, if it helps, is as follows

structure(c("Can't remember the last time", 
"\a"), Author = character(0), DateTimeStamp = structure(list(
    sec = 40.6046140193939, min = 56L, hour = 13L, mday = 29L, 
    mon = 5L, year = 113L, wday = 6L, yday = 179L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "comment1.txt", Language = "english", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")) 
"\a"), Author = character(0), DateTimeStamp = structure(list(
    sec = 40.7186260223389, min = 56L, hour = 13L, mday = 29L, 
    mon = 5L, year = 113L, wday = 6L, yday = 179L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "comment99.txt", Language = "english", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character"))

I can see the "c(" and "\a" in the code above.

Upvotes: 1

Views: 1228

Answers (2)

Hong Ooi
Hong Ooi

Reputation: 57696

If the two substrings at the start and the end are fixed for all strings, you don't need regexes at all. Just use substr:

substr(string2, 4, nchar(string2) - 6)

If the substring at the end is variable, but can only contain backslashes, double quotes and a, the regex is:

"[\\\\ \"a]*)$"

Thus we can use sub as follows:

sub("[\\\\ \"a]*)$", "", substr(string1, 4, nchar(string1)))

Upvotes: 3

vaettchen
vaettchen

Reputation: 7659

As @Mark Miller points out, your question is not very clear. But I guess that

library( stringr )
str_replace_all( string1, '\\"', "" )

solves your first problem and then

string2 <- str_replace_all( string2, '\\"a', "" )
str_replace_all( string2, '\\"', "" )
str_replace( str2, ')', "" )

the second.

Upvotes: 2

Related Questions