ASSOND
ASSOND

Reputation: 37

Using grep function for text mining

I have problem while scoring my data. Below is the data set. text are the tweets from where I want to do text mining and sentiment analysis

**text**                                         **call    bills    location**
-the bill was not generated                           0        bill       0
-tried to raise the complaint                         0         0         0 
-the location update failed                           0         0       location
-the call drop has increased in my location         call        0       location
-nobody in the location received bill,so call ASAP  call      bill      location

THIS IS THE DUMMY DATA, where Text is the column from where I am trying to do text mining, I have used grep function in R to create columns(e.g. bills, calls, location) and if bills is there in any row, under the column name write bill and likewise for all the other categories.

vdftweet$app = ifelse(grepl('app',tolower(vdftweet$text)),'app',0)
table(vdftweet$app)

Now, the problem which I am not able to understand is

I want to create a new column "category_name", under which each row should give the name of the category they fall into. if there are more than 3 category for each tweet mark it as 'other'. Else give the names of category.

Upvotes: 3

Views: 328

Answers (2)

WaltS
WaltS

Reputation: 5530

There are a couple of ways you could do this using the tidyverse package. In the first method, mutate is used to add the category names as columns to the text data.frame similar to what you have. gather is then used to transform that to key-value format in which the categories are values in the category_name column.

The Alternative approach is to go directly to the key-value format in which categories are values in the category_name column. Rows are repeated if they fall into multiple categories. If you don't need the first form with the categories as column names, the Alternative approach is more flexible for adding new categories and requires less processing.

In both methods, str_match contains the regular expression matching the category to the text. The pattern here is trivial but a more complex pattern could be used if needed.

The code follows:

library(tidyverse)
#
# read dummy data into data frame
#
   dummy_dat <- read.table(header = TRUE,stringsAsFactors = FALSE, 
                      strip.white=TRUE, sep="\n",
          text= "text
            -the bill was not generated
          -tried to raise the complaint
          -the location update failed
          -the call drop has increased in my location
          -nobody in the location received bill,so call ASAP")
#
#  form data frame with categories as columns
#
   dummy_cats <-  dummy_dat %>% mutate(text = tolower(text),
                               bill = str_match(.$text, pattern="bill"), 
                               call = str_match(.$text,  pattern="call"), 
                               location = str_match(.$text, pattern="location"),
                               other = ifelse(is.na(bill) & is.na(call) &
                                              is.na(location), "other",NA))
#
#  convert categories as columns to key-value format
#  withcategories as values in category_name column
#

   dummy_cat_name <- dummy_cats %>% 
               gather(key = type, value=category_name, -text,na.rm = TRUE) %>%
               select(-type) 

#
#---------------------------------------------------------------------------
#
#  ALTERNATIVE:  go directly from text data to key-value format with categories
#  as values under category_name
#  Rows are repeated if they fall into multiple categories
#  Rows with no categories are put in category other
#
   dummy_dat <- dummy_dat %>% mutate(text=tolower(text))
   dummy_cat_name1 <- data.frame(text = NULL, category_name =NULL)
   for( cat in c("bill", "call", "location")) {
      temp <-  dummy_dat %>% mutate(category_name = str_match(.$text, pattern=cat)) %>% na.omit() 
      dummy_cat_name1 <- dummy_cat_name1 %>% bind_rows(temp) 
    }
    dummy_cat_name1 <- left_join(dummy_dat, dummy_cat_name1, by = "text") %>%
               mutate(category_name = ifelse(is.na(category_name), "other", category_name))

The result is

 dummy_cat_name1
                                            text      category_name
                            -the bill was not generated          bill
                          -tried to raise the complaint         other
                            -the location update failed      location
            -the call drop has increased in my location          call
            -the call drop has increased in my location      location
     -nobody in the location received bill,so call asap          bill
     -nobody in the location received bill,so call asap          call
     -nobody in the location received bill,so call asap      location

Upvotes: 1

gfgm
gfgm

Reputation: 3647

Here is an approach with apply and checking to see if the names of the columns intersect with the entries in the row:

df1 <- data.frame(text = c("blah blah bill", "blah call", "the location failed to update", 
                           "bill, call, location, blah", "bill blah location", "bill, call, location, app"),
                  bill = c('bill', 0, 0, 'bill', "bill", 'bill'),
                  call = c(0, 'call', 0, 'call', 0, 'call'),
                  location = c(0,0,'location','location', "location", 'location'),
                  app = c(0,0,0,0,0, 'app'))
df1
#>                            text bill call location app
#> 1                blah blah bill bill    0        0   0
#> 2                     blah call    0 call        0   0
#> 3 the location failed to update    0    0 location   0
#> 4    bill, call, location, blah bill call location   0
#> 5            bill blah location bill    0 location   0
#> 6     bill, call, location, app bill call location app

df1$category_name <- apply(df1[-1], 1, 
                          function(row){hits <- names(row)[which(names(row) %in% row)]; ifelse(length(hits)<4, 
                                                                        paste(hits, collapse = ", "), "Other")})
df1
#>                            text bill call location app
#> 1                blah blah bill bill    0        0   0
#> 2                     blah call    0 call        0   0
#> 3 the location failed to update    0    0 location   0
#> 4    bill, call, location, blah bill call location   0
#> 5            bill blah location bill    0 location   0
#> 6     bill, call, location, app bill call location app
#>          category_name
#> 1                 bill
#> 2                 call
#> 3             location
#> 4 bill, call, location
#> 5       bill, location
#> 6                Other

If the names of the columns do not correspond to the terms you are searching for, but you have those terms stored in some vector e.g. keys the same approach will work just insert keys wherever names(row) appears in the code.

Created on 2018-05-10 by the reprex package (v0.2.0).

Upvotes: 1

Related Questions