Reputation: 8044
I am working on a set with dimensions
dim(data)
[1] 419612 2
Where second column look more-or-lesslike this:
> unique(data[1:50,"topics"])
[1] {"dom":2.0,"moda":3.0,"rodzina":1.55,"praca":1.42,"finanse":1.96,"edukacja":1.67,"sport":1.96,"muzyka":1.52,"kuchnia":1.8,"plotka":1.8,"zdrowie":1.12,"kibic":1.8,"uroda":2.32,"gra":2.94,"motoryzacja":1.33,"kultura":1.42,"film":3.14,"podróż":1.9,"technologia":1.31}
[2] {"rodzina":2.99,"kultura":4.46,"muzyka":4.5}
[3] {"dom":1.93,"rodzina":5.37,"zwierzęta":3.0,"praca":4.3,"finanse":2.11,"sport":2.1,"muzyka":2.99,"nieruchomość":2.8,"kuchnia":6.4,"plotka":2.1,"zdrowie":3.79,"gra":4.25,"motoryzacja":2.57,"kultura":3.13,"film":4.4,"podróż":3.21}
[4] {"plotka":9.5,"uroda":10.06,"kultura":15.67,"muzyka":29.97}
[5] {"dom":2.99,"rodzina":2.5,"edukacja":3.85,"sport":1.17,"muzyka":1.23,"nieruchomość":2.95,"kuchnia":1.42,"wnętrze":1.33,"kibic":1.17,"ogród":1.33,"motoryzacja":1.17,"film":1.17,"podróż":1.57}
[6] {"kuchnia":4.38,"plotka":1.33,"rodzina":1.61,"film":1.33}
37530 Levels: {"biznes":1.0} ... {"zwierzęta":9.96,"podróż":9.97}
For each row I'd like to choose te word from topics
column that have the highest grade after :
sign. I tried to use mutate function from dplyr
package it looks like it did not work. Opeartions on characters where made with stringi
package that are a faster version of stringr
. My code and resultof this operation is below. Anyone knows why I get the same value in every row after this operation, and how to achieve the desired result without using for
loop?
> data2 <- data %>%
+ mutate( xx = topics %>%
+ stri_extract_all_regex(pattern = "[a-zA-Z0-9óśćłźżęą\\.\\s]+") %>%
+ unlist %>%
+ data.frame( topic = .[seq(1,length(.), by=2)],
+ waga = .[seq(2,length(.), by=2)] ) %>%
+ select( topic, waga) %>% arrange( desc( waga)) %>%
+ unique() %>%
+ .[1,1]
+ )
> table(data2$xx)[ which(table(data2$xx) > 1) ]
kuchnia
419612
I've added extra column nr
that is a row number, and then I've stupidly group_by
ed on that column and summarise
d instead of mutate
and achived what I desired... but I'm not proud of my code. Any other ideas?
daneBC1 <- data %>%
group_by( nr) %>%
summarise( bc1 = topics %>%
stri_extract_all_regex(pattern = "[a-zA-Z0-9óśćłźżęą\\.\\s]+") %>%
unlist %>%
data.frame( topic = .[seq(1,length(.), by=2)],
waga = .[seq(2,length(.), by=2)] ) %>%
select( topic, waga) %>% arrange( desc( waga)) %>%
unique() %>%
.[1,1] )
daneBC1$bc1 %>% table
dom edukacja film finanse gra kibic kuchnia kultura
119802 79487 55569 38134 30425 21757 16371 12356
moda motoryzacja muzyka plotka podróż praca rodzina sport
11103 7264 6357 4855 3520 3005 2317 2183
technologia uroda zdrowie
1441 1055 740
Sample data
library(archivist)
data <- loadFromGithubRepo( "97f74c5a10f510cce39eafb0d9a1a9e8",
user="MarcinKosinski", repo="Museum", value = TRUE )
Upvotes: 1
Views: 389
Reputation: 206197
Your mutate()
function is not "vectorized". Mutate doesn't operate on a row at a time, it operates on entire columns as vectors. Your unlist
and and .[1,1]
extraction are taking the values for all rows and collapsing down to one vector and one value.
You can make a vectorized tranformation function with
extr <- Vectorize(. %>%
stri_extract_all_regex(pattern = "[a-zA-Z0-9óśćłźżęą\\.\\s]+") %>%
unlist %>%
data.frame( topic = .[seq(1,length(.), by=2)],
waga = .[seq(2,length(.), by=2)] ) %>%
select( topic, waga) %>% arrange( desc( waga)) %>%
unique() %>%
.[1,1])
and then use it with
data %>% mutate( xx = extr(topics))
although I agree with others that since you have JSON data, it would be better to properly parse this data with a JSON parser rather than trying to re-invent the wheel with regular expressions.
Upvotes: 2