nickolakis
nickolakis

Reputation: 621

Text subsetting of a data frame in R

I have two vectors with given names as follows in R:

A <- data.frame(c("Nick", "Maria", "Liam", "Oliver", "Sophia", "james", "Lucas; Luc"))
B  <- data.frame(c("Liam", "Luc", "Evelyn; Eva", "James", "Harper", "Amelia"))

I want to compare the two vectors and create a vector C with the names of vector B that are not in the vector A. I want the code to ignore the capital letters, i.e. to recognise that James and james is the same and if the name appear as two names (given name and preferred name), e.g., Lucas; Luc, to recognise it as the same. In the end, the result must be

C <- data.frame(c("Evelyn; Eva", "Harper","Amelia")) 

Can someone help me?

Upvotes: 0

Views: 45

Answers (3)

RYann
RYann

Reputation: 683

Probably the ugliest code i did but it works.

A <- str_to_title(c("Nick", "Maria", "Liam", "Oliver", "Sophia", "james", "Lucas; Luc"))
B  <- str_to_title(c("Liam", "Luc", "Evelyn; Eva", "James", "Harper", "Amelia"))

# Long version if you wish:
nested <- tibble(given=str_extract(c(A,B),"^[^;]+"),
           preferred=str_extract(c(A,B),";\\s*([^;]+)") %>% str_extract("[a-zA-Z]+"),
           list=c(rep("A",length(A)),rep("B",length(B)))) %>% nest_by(list)
A <- nested$data[[1]]
B <- nested$data[[2]]
unique_b <- B$given %in% A$given | B$given %in% A$preferred

B %>% filter(given %in% B$given[!unique_b]) %>%
  mutate(c=ifelse(is.na(preferred),given,str_c(given,preferred,sep  = "; "))
) %>% pull(c)

Upvotes: 1

RYann
RYann

Reputation: 683

I'm also not sure if i got you right

A <- str_to_title(c("Nick", "Maria", "Liam", "Oliver", "Sophia", "james", "Lucas; Luc"))
B  <- str_to_title(c("Liam", "Luc", "Evelyn; Eva", "James", "Harper", "Amelia"))
c <- B[!A %in% B]

Upvotes: 0

M.Viking
M.Viking

Reputation: 5398

Not understanding what you mean by names appear as part of double name. However, the bulk of your question can be answered with:

  • setdiff() to identify the differences between two sets (from tidyverse/dplyr library), and
  • the base toupper() to make all strings capitalized (or tolower() for the reverse)

We also need to name the column in the dataframe. I called both "x"

A <- data.frame(x=c("Nick", "Maria", "Liam", "Oliver", "Sophia", "james", "Lucas; Luc"))
B <- data.frame(x=c("Liam", "Luc", "Evelyn; Eva", "James", "Harper", "Amelia"))

library(tidyverse)
?union
union(setdiff(toupper(A$x), toupper(B$x)), setdiff(toupper(B$x), toupper(A$x)))

#[1] "NICK"        "MARIA"       "OLIVER"      "SOPHIA"      "LUCAS; LUC"  "LUC"         "EVELYN; EVA" "HARPER"      "AMELIA"     

Note that setdiff is asymmetrical, so we need to do A different from B, combined with B different from A.

More info here Unexpected behavior for setdiff() function in R

Upvotes: 0

Related Questions