Reputation: 1
I a fresh R user and I can't find how to properly spit my data into 5 columns (name, surname, title, area_code and phone_number).
df=read.table("school.txt")
df <- data.frame(
stringsAsFactors = FALSE,
V1= c("Lebel, MarieStudent 1st year216 132-3789",
"Lachance, PaulTeacher 2nd year567 990-345 ext 1811",
"Smith, AnnieStudent 1st yearNot available")
I was able to separate the data into 2 columns to get the names by doing this:
df1= data.frame(str_split_fixed(df$V1, ",", 2))
Thank you in advance
Upvotes: 0
Views: 34
Reputation: 388797
You can use regex to separate out the data into different columns. Using tidyr::extract
:
tidyr::extract(df, V1,
c("surname", "name", "title", "year","area_code", "phone_number"),
'(\\w+),\\s([A-Za-z]+)(Teacher|Student)\\s(\\w+\\syear)(\\d+)?\\s?(.*)?')
# surname name title year area_code phone_number
#1 Lebel Marie Student 1st year 216 132-3789
#2 Lachance Paul Teacher 2nd year 567 990-345 ext 1811
#3 Smith Annie Student 1st year Not available
Upvotes: 1
Reputation: 740
Most likely possible to do it in fewer steps...
df %>%
separate(V1,into=c("name","step1"),sep=",") %>%
mutate(title = case_when(str_detect(step1,pattern = "Student") ~ "Student",
str_detect(step1,pattern = "Teacher") ~ "Teacher",
TRUE ~ NA_character_
)) %>%
mutate(step2 = str_replace(step1,title,replacement = "")) %>%
separate(step2,into=c("surname","step3"),"[0-9]+(st|nd|rd|th)+",remove = FALSE) %>%
mutate(step3 = str_replace(step2,surname,"")) %>%
mutate(year = str_extract(step3,"[0-9](st|nd|rd|th) year")) %>%
mutate(step4 = str_replace(step3,year,"")) %>%
mutate(area_code = str_extract(step4,"[0-9]+\\s")) %>%
mutate(phone_number = str_replace(step4,area_code,"")) %>%
dplyr::select(-step1,-step2,-step3,-step4)
Upvotes: 0