Hi2736464
Hi2736464

Reputation: 47

Grepl - selecting and counting word in output

I have a very large data-set that contains reviews from Trip-advisor and I am using the grepl function to count the occurrences of a word. Furthermore, I wish to calculate a conditional probability.

I wish to count the amount of times the word 'but' appears in the data frame and in addition to this, which reviews that contain 'but' are also helpful reviews. A helpful review can be classified as any review with a number of votes > 0. The code I have so far states which reviews are helpful and counts the the amount of times 'but' appears, but I want it to output a specific number.

Screenshot of data-set

Desired output = 'but' appears 1800 times 900 of those 1800 times, the review is helpful

Example of two reviews containing 'but':

"what a fantastic experience! this hotel has everything, amazing staff, gorgeous pool, super cool atmosphere, beautiful rooms and just a hop and a skip away from the busy kamari strip. it's about 5 minute walk from the main strip but you'll be glad not to be in the thick of things, this place is truly a sanctuary and the best!!"

"we stayed here for a week at the beginning of may. this was the start of the season, but the staff were full of enthusiasm and very friendly and helpful. despite the fact our thomson rep came to the hotel every morning we still went to the hotel staff for advice and recommendations for the island. they booked reservations for restaurants we visited and also arranged car hire for us. the hotel itself is very clean and tidy, you are welcomed into a courtyard area with palm trees and rustic seating areas. the reception is nice and bright and clean with a massive bookshelf full of books that you can borrow during your stay. breakfast area is nicely organised and the food is very good."

dfRev <- read.csv("reviews_final.csv", row.names = 1, stringsAsFactors = FALSE)
dfRev$review_body <- tolower(dfRev$review_body)
View(dfRev)
ifelse(dfRev$helpful_votes > 0, "Review helpful", "Review not helpful")

dfRev <- tolower(dfRev)

dfRev$but <- grepl("but", dfRev$review_body)

Upvotes: 1

Views: 63

Answers (2)

AndrewGB
AndrewGB

Reputation: 16876

Here is another possible solution with tidyverse, where we can determine if a statement contains but, where str_detect returns a logical, then use + to convert to 0 or 1. Then, we get the sum of the column. Then, we can get the sum of the the reviews that were both helpful and contained the word but.

library(tidyverse)

results <- df %>%
  summarise(n_but = sum(+str_detect(tolower(review_body), "but"), na.rm = T),
         n_but_helpful = sum(+(helpful_votes > 0 & n_but > 0), na.rm = T))

results
#   n_but n_but_helpful
# 1     3             2

Output

We can use a simple paste to get a statement with the data.

paste("'but' appears", results$n_but, "times, and", results$n_but_helpful, "of those", results$n_but, "times, the review is helpful")

# "'but' appears 3 times, and 2 of those 3 times, the review is helpful"

For RMD

Another option if you are working in RMarkdown is that you can have your statement outside of a code block, then you can use the code in line to fill in the values, such as:

'but' appears r results$n_but times, and r results$n_but_helpful of those r results$n_but times, the review is helpful"

Then, when you knit the document, you will see the values filled in.

Data

df < -structure(list(review_body = c("the accomodation was not great, but I had a great time", 
"what but a fantastic experience! this hotel has everything, amazing staff, gorgeous pool, super cool atmosphere, beautiful rooms and just a hop and a skip away from the busy kamari strip. it's about 5 minute walk from the main strip but you'll be glad not to be in the thick of things, this place is truly a sanctuary and the best!!", 
"we stayed here for a week at the beginning of may. this was the start of the season, but the staff were full of enthusiasm and very friendly and helpful. despite the fact our thomson rep came to the hotel every morning we still went to the hotel staff for advice and recommendations for the island. they booked reservations for restaurants we visited and also arranged car hire for us. the hotel itself is very clean and tidy, you are welcomed into a courtyard area with palm trees and rustic seating areas. the reception is nice and bright and clean with a massive bookshelf full of books that you can borrow during your stay. breakfast area is nicely organised and the food is very good."
), helpful_votes = c(0, 2, 4)), class = "data.frame", row.names = c(NA, 
-3L))

Upvotes: 1

Skaqqs
Skaqqs

Reputation: 4140

You can count the number of TRUE using sum(). For example:

sum(TRUE)
[1] 1

sum(c(TRUE, TRUE))
[1] 2

sum(c(TRUE, TRUE, FALSE, FALSE, TRUE))
[1] 3

You can compare two vectors of logicals and return a count of pairs that are both TRUE:

a <- c(TRUE, FALSE, TRUE)
b <- c(FALSE, TRUE, TRUE)
sum(a & b)
[1] 1

With your criteria:

data.frame(
  "n_but" = sum(grepl("but", dfRev$review_body)),
  "n_helpful" = sum(dfRev$helpful_votes > 0),
  "n_but_helpful" = sum(grepl("but", dfRev$review_body) & dfRev$helpful_votes > 0))

Upvotes: 1

Related Questions