Reputation: 513
I'm working on a data.frame trying to extract a part of a string between ,
and .
and putting that into a neww column. I would like to use dplyr.
library(dplyr)
name <- c("Cumings, Mrs. John Bradley","Heikkinen, Miss. Laina","Moran, Mr. James","Allen, Mr. William Henry","Futrelle, Mrs. Jacques Heath (Lily May Peel)")
sex <- c("female","female","male","male","female")
age <- c(22,23,24,37,42)
data <- data.frame(name,sex,age)
So I want to extract Mrs, Misss, Mr and so on into a own column.
data %>%
mutate(title = strsplit(name, split = "[,.]")) %>%
select(name,title)
Upvotes: 1
Views: 1218
Reputation:
guess something like this: data %>% mutate(title = gsub(".*, |\\..*", "", name))
Upvotes: 0
Reputation: 18661
Similar to @Benjamin's answer (Base R's equivalent to str_extract_all
), here's how to do it using regmatches
+ gregexpr
+ positive lookahead:
library(dplyr)
data %>%
mutate(title = regmatches(data$name, gregexpr("\\b[[:alpha:]]+(?=[.])",
data$name, perl = TRUE))) %>%
select(name,title)
Result:
name title
1 Cumings, Mrs. John Bradley Mrs
2 Heikkinen, Miss. Laina Miss
3 Moran, Mr. James Mr
4 Allen, Mr. William Henry Mr
5 Futrelle, Mrs. Jacques Heath (Lily May Peel) Mrs
\\b
matches a "word boundary", which in this case is a space. perl = TRUE
is needed to utilize positive lookahead (?=[.])
, which essentially says "only if the pattern is followed by a .
"
Upvotes: 1
Reputation: 886998
Without using any external package
data$title <- with(data, sub("^[^,]+,\\s*(\\S+).*", "\\1", name))
data$title
#[1] "Mrs." "Miss." "Mr." "Mr." "Mrs."
Upvotes: 1
Reputation: 17369
str_extract
will retrieve the first instance within each string:
library(dplyr)
library(stringr)
data <- data.frame(name,sex,age) %>%
mutate(title = str_extract(name, ",.+\\."),
title = str_replace_all(title, "([[:punct:]]| )", ""))
A slightly more efficient solution:
data %>%
mutate(title = str_trim(str_extract(name, regex("(?<=,).*?(?=\\.)"))))
The (?<=,)
says to look after a comma, the (?=\\.)
says to look before the period, and the .*?
says grab everything in between. the str_trim
removes the leading and trailing white space.
Upvotes: 2
Reputation: 28329
Delete everything outside ,.
using gsub(".*, |\\..*", "", name)
:
library(dplyr)
data %>% mutate(title = gsub(".*, |\\..*", "", name))
gsub(".*, ", "", name)
: deletes everything before ,
, ,
itself and space after.
gsub("\\..*", "", name)
: deletes .
and everything after it.
|
combines two gsub patterns.
Upvotes: 2
Reputation: 359
Im having no answer to the dplyr problem.
I just wanted to mention, that this way of splitting the salutation from a name is a way which will probably encounter multiple errors when using real world data.
A better (but still error-prone) way to do this is by creating a lookup table for common salutations while utilizing on regex.
The advantage over splitting the data lies within the fact that if there is no hit in the regex, it remains empty (NA) and can easily be fixed manually, but does not create inconsistent data in the first step.
Upvotes: 1