Science11
Science11

Reputation: 883

r string data cleaning

I have dataset with some weird mix of strings like this

   ID   State
   1    NA
   2    IL
   3    IL,IL,IL
   4    OH,IL
   5    NM,NM,AL,AL
   6    FL,FL,FL

I like to

Expected dataset

   ID   State
   1    NA
   2    IL
   3    IL
   4    NA
   5    NA
   6    FL

I tried paste(unique(df$State), collapse=",") but that did not work. Any suggestions regarding this is much appreciated. Thanks.

Upvotes: 0

Views: 79

Answers (2)

hello_friend
hello_friend

Reputation: 5788

Base R solution:

with(
  df,
  vapply(
    State,
    function(x){
      y <- toString(
        unique(
          unlist(
            strsplit(
              x,
              ","
              )
            )
          )
        )
      ifelse(
        grepl(
          ",|NA",
          y),
        NA_character_,
        y
      )
    },
    character(1),
    USE.NAMES = FALSE
  )
)

Tidyverse solution:

library(tidyverse)
str_split(df$State, ",") %>% 
  map(function(x) str_c(unique(x), collapse = ", ")) %>% 
  map_chr(function(y) if_else(str_detect(y, ","), NA_character_, y))

Data:

df <- structure(list(ID = 1:6, State = c(NA, "IL", "IL,IL,IL", "OH,IL", 
"NM,NM,AL,AL", "FL,FL,FL")), class = "data.frame", row.names = c(NA, 
-6L))

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

Split the State values on comma (,) and replace it with NA values if there are two different values in them. Use distinct to keep only the unique rows for each ID.

library(dplyr)
library(tidyr)

df %>%
  separate_rows(State, sep = ',\\s*') %>%
  group_by(ID) %>%
  mutate(State = replace(State, n_distinct(State) > 1, NA)) %>%
  distinct() %>%
  ungroup()

#     ID State
#  <int> <chr>
#1     1 NA   
#2     2 IL   
#3     3 IL   
#4     4 NA   
#5     5 NA   
#6     6 FL   

Upvotes: 3

Related Questions