hheejuice
hheejuice

Reputation: 49

R name cleanup to get correct unique variables

Hi I have a dataframe with names and some numbers. The number in column 't1' should be unique for each person but there are some cases they are not. So I want to find those rows where one number in column 't1' corresponds to two different person, so that I can remove those data from my dataframe for future analysis.

This is some toy data that looks like my original data:

           name t1 t2 t3
1      John Doe  1  a  a
2      Doe John  1  a  a
3    John Doe A  1  b  b
4 David Freeman  2  b  b
5 Freeman David  2  b  b
6     Jack Chen  3  b  b
7     John Chen  3  b  b
8 Chris Baker F  4  b  b
9 Baker O Chris  4  b  b
> 

And this is the desired output:

          name t1 
1     Jack Chen  3  
2     John Chen  3  
3 Chris Baker F  4  
4 Baker O Chris  4  

The following are my code and output I get:

bio_share_id <- bio %>% 
  distinct(name, t1, .keep_all = T) %>% 
  group_by(t1) %>% 
  filter(n() > 1) %>% 
  ungroup() %>%
  select(c(name,t1))

  name          t1   
  <chr>         <chr>
1 John Doe      1    
2 Doe John      1    
3 John Doe A    1    
4 David Freeman 2    
5 Freeman David 2    
6 Jack Chen     3    
7 John Chen     3    
8 Chris Baker F 4    
9 Baker O Chris 4 

As you can see, my code is reading each name as unique variable whenever there are some variations in name formats. It's reading John Doe, Doe John, and John Doe A all as different person but I want it to read them as one person.

Since the original data I have is so big, I can't fix the names manually so I want to find a way to clean those names using R. John Doe, Doe John, and John Doe A should be treated as one individual, but Chris Baker F and Bake O Chris should be treated as two different unique individual.

Thanks in advance!

Upvotes: 2

Views: 233

Answers (2)

Ashirwad
Ashirwad

Reputation: 2040

Here's my attempt:

library(tidyverse)

# Toy data
toy_data <- tribble(
  ~ name, ~ t1, ~ t2, ~ t3,
  "John Doe", 1,  "a",  "a",
  "Doe John",  1,  "a",  "a",
  "John Doe A",  1,  "b",  "b",
  "David Freeman",  2,  "b",  "b",
  "Freeman David",  2,  "b",  "b",
  "Jack Chen",  3,  "b",  "b",
  "John Chen",  3,  "b",  "b",
  "Chris Baker F",  4,  "b",  "b",
  "Baker O Chris",  4,  "b",  "b"
)

# Step 1: Nest data by t1
toy_data_nested <- toy_data %>%
  select(name, t1) %>%
  group_by(t1) %>%
  nest() %>%
  ungroup()

# Step 2: Define a function to find name differences
find_name_diff <- function(data) {
  name_pieces <- data %>%
    pull(name) %>%
    str_split(pattern = " ") %>%
    as_vector()
  
  count_by_piece <- name_pieces %>% 
    enframe() %>% 
    count(value)
  
  if (all(count_by_piece$n > 1)) { # handle cases like t1 == 2
    return("none")
  } else {
    count_by_piece %>%
      filter(n == 1) %>%
      pull(value)
  }
}

# Step 3: Get the desired output
toy_data_nested %>%
  mutate(name_diff = map(data, ~ find_name_diff(.x))) %>%
  rowwise() %>%
  mutate(diff_length = length(name_diff)) %>%
  filter(diff_length > 1) %>%
  select(data, t1) %>%
  unnest(cols = c(data))
#> # A tibble: 4 x 2
#>   name             t1
#>   <chr>         <dbl>
#> 1 Jack Chen         3
#> 2 John Chen         3
#> 3 Chris Baker F     4
#> 4 Baker O Chris     4

Created on 2020-11-22 by the reprex package (v0.3.0)

Upvotes: 2

Onyambu
Onyambu

Reputation: 79288

This is not an easy task. But you could make good use of adist function in R:

a <- sapply(strsplit(trimws(dat$name),"\\s+"), function(x)paste0(sort(x), collapse = " "))
b <- adist(a, a, partial = TRUE, ignore.case = TRUE)
dat[colSums(! b * t(b)) == 1,]
            name t1 t2 t3
6      Jack Chen  3  b  b
7      John Chen  3  b  b
8  Chris Baker F  4  b  b
9  Baker O Chris  4  b  b

Note: You might consider breaking the problem into chucks since the adist function has O(n^2) time complexity. And might not work in huge datasets

Upvotes: 2

Related Questions