Ojaswita
Ojaswita

Reputation: 93

Using strsplit results in terms with quotation marks in r

I have a large set of data, which I have imported from excel. I wish to get term frequency table for the data set. But, when I use strspplit, it includes quotation marks and other punctuation which gives wrong results.

There is a small error in the way I am using strsplit and need help on the same as I am not able to figure it out myself.

df = read_excel("C:/Users/B M Consulting/Documents/Book2.xlsx", col_types=c("text","numeric"), range=cell_cols("A:B"))

vect <- c(df[1])

vectsplit <- strsplit(tolower(vect), "\s+")

vectlev <- unique(unlist(vectsplit))

vecttermf <- sapply(vectsplit, function(x) table(factor(x, levels=vectlev)))

The output vect is something like this:

[1] "3 inch c clamp" "baby vice" "baby vice bench" "baby vise"
[5] "bench" "bench vice" "bench vice clamp" "bench vise"
[9] "bench voice" "bench wise" "bench wise heavy" "bench wise table"
[13] "box for tools" "c clamp" "c clamp set" "c clamps"
[17] "carpenter tools" "carpenter tools low price" "cast iron pipe" "clamp"
[21] "clamp set" "clamps woodworking" "g clamp" "g clamp set 3 inch"

I need to get each word out. When I use strplit, it includes all the punctuation marks.

Below is a small section of vectsplit that I get. It includes all inverted commas, backslashes and commas which I dont want.

[1] "c(\"3" "inch" "c" "clamp\"," "\"baby" "vice\"," "\"baby" "vice"
[9] "bench\"," "\"baby" "vise\"," "\"bench\"," "\"bench" "vice\"," "\"bench" "vice"
[17] "clamp\"," "\"bench" "vise\"," "\"bench" "voice\"," "\"bench" "wise\"," "\"bench"
[25] "wise" "heavy\"," "\"bench" "wise" "table\"," "\"box" "for" "tools\","
[33] "\"c" "clamp\"," "\"c" "clamp" "set\"," "\"c" "clamps\"," "\"carpenter"
[41] "tools\"," "\"carpenter" "tools" "low" "price\"," "\"cast" "iron" "pipe\","

Upvotes: 0

Views: 256

Answers (1)

Hayden Y.
Hayden Y.

Reputation: 448

If you check the class of vect, you'll notice that it's not a character vector, but a list.

vect<-c(df[1])
class(vect)
> "list"

If you define vect as below, the issue disappears:

vect<-df[[1]]
class(vect)
> "character"

If you define vect as such and then use strsplit, it should work just fine. Keep in mind that different kinds of subsetting ([1] vs. [[1]]) will produce different classes of outputs.

Upvotes: 1

Related Questions