Reputation: 844

Detecting rows with multiple observations in R

I have a dataset like this. I want to recognize all of the observations that have multiple values in the "color" columns and replace them with "multicolor"

ID  color1   color2
23   red      NA
44   blue     purple
51   yellow   NA
59   green    orange

like this:

ID  color   
23   red      
44   multicolor     
51   yellow     
59   multicolor

Any ideas would be much appreciated, thank you!

Upvotes: 3

Answers (4)

Chris Ruehlemann

Reputation: 21440

Here's what seems like a straightforward solution:

library(dplyr)
library(stringr)
data %>%
  mutate(
    # step 1 - paste `color1` and `color2` together and remove " NA":
    color = gsub("\\sNA", "", paste(color1, color2)),
    # step 2 - count the number of white space characters:
    color = str_count(color, " "),
    # step 3 - label `color` as "multicolor" where `color` != 0:
    color = ifelse(color == 0, color1, "multicolor")) %>%
  # remove the obsolete color columns: 
  select(-matches("\\d$"))
  ID      color
1 23        red
2 44 multicolor
3 51     yellow
4 59 multicolor

Data:

data <- data.frame(ID = c(23, 44, 51, 59),
                   color1 = c("red", "blue", "yellow", "green"),
                   color2 = c(NA, "purple", NA, "orange"))

Upvotes: 1

Andre Wildberg

Reputation: 19271

A base R approach

# get colors from columns named color*
colo <- paste(names(table(unlist(df1[,grep("color",colnames(df1))]))), collapse="|")

colo
[1] "blue|green|red|yellow|orange|purple"

# match the colors and do the conversion
data.frame( 
  ID=df1$ID, 
  color=apply( df1, 1, function(x){ 
    y=x[grep(colo, x)];
    if(length(y)>1){y="multicolor"}; y } ) )
  ID      color
1 23        red
2 44 multicolor
3 51     yellow
4 59 multicolor

Data

df1 <- structure(list(ID = c(23L, 44L, 51L, 59L), color1 = c("red", 
"blue", "yellow", "green"), color2 = c(NA, "purple", NA, "orange"
)), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 0

busemorose

Reputation: 317

You can do this, assuming data is your dataset.

library(dplyr)

data <- data.frame(ID = c(23, 44, 51, 59),
                   color1 = c("red", "blue", "yellow", "green"),
                   color2 = c(NA, "purple", NA, "orange"))

data %>% 
  mutate(color = ifelse(is.na(color2), color1, "multicolor")) %>% 
  select(ID, color)

Upvotes: 2

user10917479

Reputation:

Here is a way to do it in the tidyverse.

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(cols = starts_with("color"), values_to = "color", values_drop_na  = TRUE) %>% 
  group_by(ID) %>% 
  summarize(n = n(),
            color = toString(color), .groups = "drop") %>% 
  mutate(color = if_else(n > 1, "multicolor", color)) %>% 
  select(-n)

# # A tibble: 4 x 2
#      ID color     
#   <int> <chr>     
# 1    23 red       
# 2    44 multicolor
# 3    51 yellow    
# 4    59 multicolor

I did it this way on purpose. Note that if you stopped after the summarize() line, you would get the actual colors.

# # A tibble: 4 x 3
#      ID     n color        
#   <int> <int> <chr>        
# 1    23     1 red          
# 2    44     2 blue, purple 
# 3    51     1 yellow       
# 4    59     2 green, orange

This would scale if you had many color columns, not just 2. Play around with it, there are a lot of ways to tweaks things like this.

Data

df <- read.table(textConnection("ID  color1   color2
23   red      NA
44   blue     purple
51   yellow   NA
59   green    orange"), header = TRUE)

Upvotes: 5

Detecting rows with multiple observations in R

Answers (4)

Data

Data

Related Questions