R - NLP - text cleaning

Question

I am new to text mining and, currently, I stuck with this kind of pattern

pattern = c(
    "", 
    "",
    " ",
    "", 
    " ",
    "  
Body",
    " ", 
    "  ", 
    "  ",
    "  ",
      "",
     "Hello")

I would like to receive only pattern = "Hello" and exclude all the other text.

I tried the following but I failed immediately:

gsub(c, "*, replacement = "")

So, I tried to break it down:

a = gsub(c, pattern = "", replacement = "")

->result drops, so it is a good sign but when I do the next step

gsub(a, pattern = "", replacement = "")

->result: remains. Do you have some ideas? I appreciate any kind of suggestions! Thanks in advance!

IanRiley · Accepted Answer

Two ways to clean your text. There were no criteria given to allow removal of "Body".

x <- pattern # to avoid ambiguity in function parameters

# by finding words longer than two letters (so not 'a' or 'I' either)
words <- unlist(regmatches(x, gregexpr("\b[[:alpha:]]{2,}\b", x, perl=TRUE)))
words

#[1] "Body"  "Hello"

# by removing unwanted characters and character sequences
cleaned <- gsub("(<[^>]{0,}>|\r|\n)", "", x, perl=TRUE)
# and removing leading and trailing spaces
cleaned <- gsub("^ {1,}| {1,}$", "", cleaned, perl=TRUE)
cleaned[cleaned != ""]

#[1] "Body"  "Hello"

R - NLP - text cleaning

Answers (1)

Related Questions