Reputation: 13
I am new to text mining and, currently, I stuck with this kind of pattern
pattern = c(
"<f0><U+009F><U+0098><U+00AD>",
"<f0><U+009F><U+0099><U+008F>",
"<f0><U+009F><U+008F><U+00BF> ",
"<f0><U+009F><U+0098><U+0082>",
" <f0><U+009F><U+00A4><U+00B7>",
" <f0><U+009F><U+008F><U+00BD><U+200D><U+2640><U+FE0F>\r\nBody",
" <f0><U+009F><U+00A4><U+00A3>",
" <f0><U+009F><U+0099><U+0084> ",
" <f0><U+009F><U+0099><U+0084>",
" <f0><U+009F><U+0099><U+0083>",
"<f0><U+009F><U+0098><U+00B4>",
"Hello")
I would like to receive only pattern = "Hello" and exclude all the other text.
I tried the following but I failed immediately:
gsub(c, "<f0><U+00F><U+[0-9]><U+[a-zA-Z0-9]>*, replacement = "")
So, I tried to break it down:
a = gsub(c, pattern = "<f0>", replacement = "")
->result <fo>
drops, so it is a good sign but when I do the next step
gsub(a, pattern = "<U+009F>", replacement = "")
->result: <U+009F>
remains.
Do you have some ideas?
I appreciate any kind of suggestions!
Thanks in advance!
Upvotes: 1
Views: 383
Reputation: 233
Two ways to clean your text. There were no criteria given to allow removal of "Body".
x <- pattern # to avoid ambiguity in function parameters
# by finding words longer than two letters (so not 'a' or 'I' either)
words <- unlist(regmatches(x, gregexpr("\\b[[:alpha:]]{2,}\\b", x, perl=TRUE)))
words
#[1] "Body" "Hello"
# by removing unwanted characters and character sequences
cleaned <- gsub("(<[^>]{0,}>|\\r|\\n)", "", x, perl=TRUE)
# and removing leading and trailing spaces
cleaned <- gsub("^ {1,}| {1,}$", "", cleaned, perl=TRUE)
cleaned[cleaned != ""]
#[1] "Body" "Hello"
Upvotes: 1