Stataq
Stataq

Reputation: 2297

How to build a new variable from a col with a lot of words

I have a data that looks like this:

enter image description here

And i would like to build a new variable to only show music ones. I tried to use gsub to build it but it did not work. Any suggestion on how to do this. Not limit to gsub.

My codes are: df$music<-gsub("Sawing"|"Cooking", "", df$Hobby)

The outcome should be sth that looks like this:

enter image description here

Sample data can be build using codes:

df<- structure(list(Hobby = c("cooking, sawing, piano, violin", "cooking, violin", 
"piano, sawing", "sawing, cooking")), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 1

Views: 38

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389012

Another way to do this would be :

library(dplyr)
library(tidyr)

df %>%
  mutate(index = row_number()) %>%
  separate_rows(Hobby, sep = ',\\s*') %>%
  group_by(index) %>%
  summarise(Music = toString(setdiff(Hobby, c('sawing', 'cooking'))), 
            Hobby = toString(Hobby)) %>%
  select(Hobby,Music)

#  Hobby                          Music          
#  <chr>                          <chr>          
#1 cooking, sawing, piano, violin "piano, violin"
#2 cooking, violin                "violin"       
#3 piano, sawing                  "piano"        
#4 sawing, cooking                ""             

Upvotes: 1

akrun
akrun

Reputation: 887213

The double quotes opening and closing should be a single pair "Sawing|Cooking" and not "Sawing"|"Cooking" in the pattern

df$music<- trimws(gsub("Sawing|Cooking", "", df$Hobby, ignore.case = TRUE),
       whitespace ="(,\\s*){1,}")

trimws will remove the leading/lagging , with spaces (if any)


The opposite would be to extract the words of interest and paste them

library(stringr)
sapply(str_extract_all(df$Hobby, 'piano|violin'), toString)

Upvotes: 3

Related Questions