Reputation: 28129

R convert string to vector tokenize using " "

I have a string :

string1 <- "This is my string"

I would like to convert it to a vector that looks like this:

vector1
"This"
"is"
"my"
"string"

How do I do this? I know I could use the tm package to convert to termDocumentMatrix and then convert to a matrix but it would alphabetize the words and I need them to stay in the same order.

Upvotes: 27

Answers (5)

Rich Scriven

Reputation: 99321

If you're simply extracting words by splitting on the spaces, here are a couple of nice alternatives.

string1 <- "This is my string"

scan(text = string1, what = "")
# [1] "This"   "is"     "my"     "string"

library(stringi)
stri_split_fixed(string1, " ")[[1]]
# [1] "This"   "is"     "my"     "string"
stri_extract_all_words(string1, simplify = TRUE)
#      [,1]   [,2] [,3] [,4]    
# [1,] "This" "is" "my" "string"
stri_split_boundaries(string1, simplify = TRUE)
#      [,1]    [,2]  [,3]  [,4]    
# [1,] "This " "is " "my " "string"

Upvotes: 5

Shiqing Fan

Reputation: 708

As a supplement, we can also use unlist() to produce a vector from a given list structure:

string1 <- "This is my string"  # get a list structure
unlist(strsplit(string1, "\\s+"))  # unlist the list
#[1] "This"   "is"     "my"     "string"

Upvotes: 5

russellpierce

Reputation: 4711

Try:

library(tm)
library("RWeka")
library(RWekajars)
NGramTokenizer(source1, Weka_control(min = 1, max = 1))

It is an over engineered solution for your problem. strsplit using Sacha's approach is generally just fine.

Upvotes: 1

Sacha Epskamp

Reputation: 47541

Slightly different from Dason, but this will split for any amount of white space including newlines:

string1 <- "This   is my
string"
strsplit(string1, "\\s+")[[1]]

Upvotes: 15

Dason

Reputation: 61903

You can use strsplit to accomplish this task.

string1 <- "This is my string"
strsplit(string1, " ")[[1]]
#[1] "This"   "is"     "my"     "string"

Upvotes: 45

R convert string to vector tokenize using &quot; &quot;

Answers (5)

Related Questions

R convert string to vector tokenize using " "