user3710832
user3710832

Reputation: 415

How do I separate words in a given text in R?

For example I have a text file with content as follows:

I wantto separate those wordswhich arejoined.

How do I separate the words in this text so that I get this as output.

I want to separate those words which are joined.

Basically, something which can detect meaningless words from the text and make them meaningful.

For example, the code should detect that "wantto" does not make any sense and after processing it, it should be able to return "want to" as output.

It may return some other meaningful combination of words but that is fine.

Upvotes: 2

Views: 2121

Answers (2)

arvind
arvind

Reputation: 96

I am attaching a quick and dirty code that should help you to correct atleast two word spelling errors without using the aspell. The dictionary I used is the big.txt from Peter Norvig's site which should be enough for common words. You can use the correctSentence function to see the results

## big.txt Taken for Peter Norvig's basic spell checker data file
words <- scan("http://norvig.com/big.txt", what = character())

split_matches <-function(word) {
num_char <- nchar(word)
return_str <- character()
start_pos <- 0
end_pos <- num_char
for(i in 1:num_char)
{
    str <- substr(word,1,num_char-i+1)
    if(str %in% words)
    {
      return_str <- str
      start_pos <- nchar(return_str)
      break
    }

 }
 return_str <- c(return_str,substr(word,start_pos+1,end_pos))
 return_str

}

correctSentence <- function(sentence) {
  list_of_words <- strsplit(sentence," ")
  list_of_words  <- list_of_words[[1]]
  num_words <- length(list_of_words)

  output_str <- character()
  for(i in 1:num_words){
  word <- list_of_words[i]
  if(word %in% words) {
      paste(output_str,word,sep=" ")
      output_str <- c(output_str,word)
  }
  else {
     output_str <- c(output_str,split_matches(word))
  }

}
  output_str <-paste(output_str,collapse=" ")
  output_str
}
# test this with your sentence
correctSentence("I wantto separate those wordswhich arejoined")

Upvotes: 2

gagolews
gagolews

Reputation: 13046

If you have aspell (see ?aspell installed), this may give you a hint:

> writeLines("I wantto separate those wordswhich arejoined.", "/tmp/test.txt")
> sp <- aspell('/tmp/test.txt')
> sp
arejoined
  /tmp/test.txt:1:36

wantto
  /tmp/test.txt:1:3

wordswhich
  /tmp/test.txt:1:25
> sp[[5]]
[[1]]
 [1] "want to" "want-to" "want"    "wanton"  "Watt"    "watt"    "wand"    "went"    "wont"    "whatnot" "wants"   "canto"  
[13] "panto"   "Wanda"   "waned"   "won't"   "want's"  "wanted"  "NATO"    "vanity"  "wander"  "winter"  "wart"    "natty"  
[25] "vaunt"   "wan"     "ant"     "walnut"  "wasn't"  "Witt"    "wait"    "wane"    "wino"   

[[2]]
 [1] "words which" "words-which" "wordsmith"   "Wordsworth"  "words"       "Woodstock"   "word's"      "woodsier"   
 [9] "Woods"       "wards"       "woods"       "ward's"      "woad's"      "wood's"      "wort's"     

[[3]]
[1] "are joined" "are-joined" "rejoined"   "adjoined"   "enjoined"   "rejoinder"  "regained"  

Anyway, such a task will always be dictionary-based.

Upvotes: 3

Related Questions