Marcin
Marcin

Reputation: 8044

Weird behavior of mutate function from dplyr package in R

I am working on a set with dimensions

dim(data)
[1] 419612      2

Where second column look more-or-lesslike this:

> unique(data[1:50,"topics"])
[1] {"dom":2.0,"moda":3.0,"rodzina":1.55,"praca":1.42,"finanse":1.96,"edukacja":1.67,"sport":1.96,"muzyka":1.52,"kuchnia":1.8,"plotka":1.8,"zdrowie":1.12,"kibic":1.8,"uroda":2.32,"gra":2.94,"motoryzacja":1.33,"kultura":1.42,"film":3.14,"podróż":1.9,"technologia":1.31}
[2] {"rodzina":2.99,"kultura":4.46,"muzyka":4.5}                                                                                                                                                                                                                            
[3] {"dom":1.93,"rodzina":5.37,"zwierzęta":3.0,"praca":4.3,"finanse":2.11,"sport":2.1,"muzyka":2.99,"nieruchomość":2.8,"kuchnia":6.4,"plotka":2.1,"zdrowie":3.79,"gra":4.25,"motoryzacja":2.57,"kultura":3.13,"film":4.4,"podróż":3.21}                                     
[4] {"plotka":9.5,"uroda":10.06,"kultura":15.67,"muzyka":29.97}                                                                                                                                                                                                             
[5] {"dom":2.99,"rodzina":2.5,"edukacja":3.85,"sport":1.17,"muzyka":1.23,"nieruchomość":2.95,"kuchnia":1.42,"wnętrze":1.33,"kibic":1.17,"ogród":1.33,"motoryzacja":1.17,"film":1.17,"podróż":1.57}                                                                          
[6] {"kuchnia":4.38,"plotka":1.33,"rodzina":1.61,"film":1.33}                                                                                                                                                                                                               
37530 Levels: {"biznes":1.0} ... {"zwierzęta":9.96,"podróż":9.97}

For each row I'd like to choose te word from topics column that have the highest grade after : sign. I tried to use mutate function from dplyr package it looks like it did not work. Opeartions on characters where made with stringi package that are a faster version of stringr. My code and resultof this operation is below. Anyone knows why I get the same value in every row after this operation, and how to achieve the desired result without using for loop?

> data2 <- data %>%
+   mutate( xx = topics %>%
+             stri_extract_all_regex(pattern = "[a-zA-Z0-9óśćłźżęą\\.\\s]+") %>% 
+             unlist %>% 
+             data.frame( topic = .[seq(1,length(.), by=2)], 
+                         waga = .[seq(2,length(.), by=2)] )  %>% 
+             select( topic, waga) %>% arrange( desc( waga)) %>%
+             unique() %>%
+             .[1,1]
+             )
> table(data2$xx)[ which(table(data2$xx) > 1) ]
kuchnia 
 419612 

I've added extra column nr that is a row number, and then I've stupidly group_byed on that column and summarised instead of mutate and achived what I desired... but I'm not proud of my code. Any other ideas?

daneBC1 <- data %>% 
  group_by( nr)  %>%
  summarise( bc1 = topics %>%
               stri_extract_all_regex(pattern = "[a-zA-Z0-9óśćłźżęą\\.\\s]+") %>% 
               unlist %>% 
               data.frame( topic = .[seq(1,length(.), by=2)], 
                           waga = .[seq(2,length(.), by=2)] )  %>% 
               select( topic, waga) %>% arrange( desc( waga)) %>%
               unique() %>%
               .[1,1] )



daneBC1$bc1 %>% table

        dom    edukacja        film     finanse         gra       kibic     kuchnia     kultura 
     119802       79487       55569       38134       30425       21757       16371       12356 
       moda motoryzacja      muzyka      plotka      podróż       praca     rodzina       sport 
      11103        7264        6357        4855        3520        3005        2317        2183 
technologia       uroda     zdrowie 
       1441        1055         740 

Sample data

library(archivist)
data <- loadFromGithubRepo( "97f74c5a10f510cce39eafb0d9a1a9e8", 
user="MarcinKosinski", repo="Museum", value = TRUE )

Upvotes: 1

Views: 389

Answers (1)

MrFlick
MrFlick

Reputation: 206197

Your mutate() function is not "vectorized". Mutate doesn't operate on a row at a time, it operates on entire columns as vectors. Your unlist and and .[1,1] extraction are taking the values for all rows and collapsing down to one vector and one value.

You can make a vectorized tranformation function with

extr <- Vectorize(. %>%
         stri_extract_all_regex(pattern = "[a-zA-Z0-9óśćłźżęą\\.\\s]+") %>% 
         unlist %>% 
         data.frame( topic = .[seq(1,length(.), by=2)], 
                     waga = .[seq(2,length(.), by=2)] )  %>% 
         select( topic, waga) %>% arrange( desc( waga)) %>%
         unique() %>%
         .[1,1])

and then use it with

data %>% mutate( xx = extr(topics))

although I agree with others that since you have JSON data, it would be better to properly parse this data with a JSON parser rather than trying to re-invent the wheel with regular expressions.

Upvotes: 2

Related Questions