Reputation: 499
I have a data frame which I want to parse based on relatives column
df <- data.frame(Name = c("abc","def","ghi","klm"),
Relatives = c("Mother & Father","Mother and brother","Husband,wife,Sister in law and Daughter","Self"))
Name Relatives
abc Mother & Father
def Mother and brother
ghi Husband,wife,Sister in law and Daughter
klm Self
Final output
Name Relatives
abc Mother
abc Father
def Mother
def brother
ghi Husband
ghi wife
ghi Sister in law
ghi daughter
klm self
Any help is useful, Thanks!
Upvotes: 0
Views: 44
Reputation: 23
Try something like this
df %>%
separate(Relatives, sep = "( & |,| and )", into = letters[1:3]) %>%
pivot_longer(2:4, values_to = "Relatives",values_drop_na = TRUE) %>%
select(Name, Relatives)
and you'll get this:
Name Relatives
<chr> <chr>
1 abc Mother
2 abc Father
3 def Mother
4 def and
5 def brother
6 ghi Husband
7 ghi wife
8 ghi and
9 klm self
Upvotes: 0
Reputation: 1137
Using tidyverse and tidytext:
library(tidyverse)
library(tidytext)
df <- data.frame(
Name = c("abc", "def", "ghi", "klm"),
Relatives = c(
"Mother & Father",
"Mother and brother",
"Husband,wife,Sister in law and Daughter",
"Self"
)
)
df %>%
mutate(across(where(is.factor), as.character)) %>%
unnest_tokens(
text,
Relatives,
token = "regex",
pattern = "\\&| and |,",
to_lower = FALSE
) %>%
mutate(text = str_trim(text))
Output:
Name text
1 abc Mother
2 abc Father
3 def Mother
4 def brother
5 ghi Husband
6 ghi wife
7 ghi Sister in law
8 ghi Daughter
9 klm Self
Upvotes: 0
Reputation: 388982
Using tidyr::separate_rows
:
library(dplyr)
library(tidyr)
df %>%
separate_rows(Relatives, sep = '&|\\band\\b|,') %>%
mutate(Relatives = trimws(Relatives))
# Name Relatives
# <chr> <chr>
#1 abc Mother
#2 abc Father
#3 def Mother
#4 def brother
#5 ghi Husband
#6 ghi wife
#7 ghi Sister in law
#8 ghi Daughter
#9 klm Self
Upvotes: 2
Reputation: 33488
Quick and dirty in base R:
tmp <- strsplit(df$Relatives, '\\&|,| and ')
data.frame(Name = rep(df$Name, lengths(tmp)), Relatives = trimws(unlist(tmp)))
# Name Relatives
# 1 abc Mother
# 2 abc Father
# 3 def Mother
# 4 def brother
# 5 ghi Husband
# 6 ghi wife
# 7 ghi Sister in law
# 8 ghi Daughter
# 9 klm Self
Same logic with data.table
:
library(data.table)
setDT(df)
df[, strsplit(Relatives, '\\&|,| and '), by = Name
][, Relatives := trimws(V1)][, V1 := NULL][]
Upvotes: 3