Dan
Dan

Reputation: 513

Getting a split out of a string into a new column

I'm working on a data.frame trying to extract a part of a string between , and . and putting that into a neww column. I would like to use dplyr.

library(dplyr)

name <- c("Cumings, Mrs. John Bradley","Heikkinen, Miss. Laina","Moran, Mr. James","Allen, Mr. William Henry","Futrelle, Mrs. Jacques Heath (Lily May Peel)")
sex <- c("female","female","male","male","female")
age <- c(22,23,24,37,42)
data <- data.frame(name,sex,age)

So I want to extract Mrs, Misss, Mr and so on into a own column.

data %>%
  mutate(title = strsplit(name, split = "[,.]")) %>%
  select(name,title)

Upvotes: 1

Views: 1218

Answers (6)

user9039365
user9039365

Reputation:

guess something like this: data %>% mutate(title = gsub(".*, |\\..*", "", name))

Upvotes: 0

acylam
acylam

Reputation: 18661

Similar to @Benjamin's answer (Base R's equivalent to str_extract_all), here's how to do it using regmatches + gregexpr + positive lookahead:

library(dplyr)
data %>%
  mutate(title = regmatches(data$name, gregexpr("\\b[[:alpha:]]+(?=[.])", 
                                                data$name, perl = TRUE))) %>%
  select(name,title)

Result:

                                          name title
1                   Cumings, Mrs. John Bradley   Mrs
2                       Heikkinen, Miss. Laina  Miss
3                             Moran, Mr. James    Mr
4                     Allen, Mr. William Henry    Mr
5 Futrelle, Mrs. Jacques Heath (Lily May Peel)   Mrs

\\b matches a "word boundary", which in this case is a space. perl = TRUE is needed to utilize positive lookahead (?=[.]), which essentially says "only if the pattern is followed by a ."

Upvotes: 1

akrun
akrun

Reputation: 886998

Without using any external package

data$title <- with(data, sub("^[^,]+,\\s*(\\S+).*", "\\1", name))
data$title
#[1] "Mrs."  "Miss." "Mr."   "Mr."   "Mrs." 

Upvotes: 1

Benjamin
Benjamin

Reputation: 17369

str_extract will retrieve the first instance within each string:

library(dplyr)
library(stringr)

data <- data.frame(name,sex,age) %>% 
  mutate(title = str_extract(name, ",.+\\."),
         title = str_replace_all(title, "([[:punct:]]| )", ""))

A slightly more efficient solution:

data %>% 
      mutate(title = str_trim(str_extract(name, regex("(?<=,).*?(?=\\.)"))))

The (?<=,) says to look after a comma, the (?=\\.) says to look before the period, and the .*? says grab everything in between. the str_trim removes the leading and trailing white space.

Upvotes: 2

pogibas
pogibas

Reputation: 28329

Delete everything outside ,. using gsub(".*, |\\..*", "", name):

library(dplyr)
data %>% mutate(title = gsub(".*, |\\..*", "", name))

enter image description here

gsub(".*, ", "", name): deletes everything before ,, , itself and space after. gsub("\\..*", "", name): deletes . and everything after it.
| combines two gsub patterns.

Upvotes: 2

Christian
Christian

Reputation: 359

Im having no answer to the dplyr problem.

I just wanted to mention, that this way of splitting the salutation from a name is a way which will probably encounter multiple errors when using real world data.

A better (but still error-prone) way to do this is by creating a lookup table for common salutations while utilizing on regex.

The advantage over splitting the data lies within the fact that if there is no hit in the regex, it remains empty (NA) and can easily be fixed manually, but does not create inconsistent data in the first step.

Upvotes: 1

Related Questions