user9292
user9292

Reputation: 1145

extract the first two characters from a list of names in r

The data frame df1 contains two columns: id and list_names

id <- seq(1,5)
list_names <- c("john", 
                "adam, sally", 
                "rebecca", 
                "zhang, mike, antonio", 
                "mark, henry, scott, john, steve, jason, nancy")

df1 <- data.frame(id, list_names)

I need to add an additional column that contains the first two characters extracted from every name.

The new data set would look like

enter image description here

Note that the number of names in each row does not need to be specified as it could be anything.

Upvotes: 2

Views: 611

Answers (2)

Andrew Haynes
Andrew Haynes

Reputation: 2640

In a for loop split each observation with strsplit() on ', ' then substr the first two characters, and then paste back together:

for(g in df1$list_names){
  print(
   paste(substr(unlist(strsplit(g, ', ')),1,2), collapse = ', ')
  )
}

[1] "jo"
[1] "ad, sa"
[1] "re"
[1] "zh, mi, an"
[1] "ma, he, sc, jo, st, ja, na"

or you can one line this with sapply:

df1$new_list_names = sapply(df1$list_names, function(g) paste(substr(unlist(strsplit(as.character(g), ', ')),1,2), collapse = ', '))

> df1
  id                                    list_names             new_list_names
1  1                                          john                         jo
2  2                                   adam, sally                     ad, sa
3  3                                       rebecca                         re
4  4                          zhang, mike, antonio                 zh, mi, an
5  5 mark, henry, scott, john, steve, jason, nancy ma, he, sc, jo, st, ja, na

Upvotes: 2

akrun
akrun

Reputation: 887621

We can use str_extract_all to extract two characters after the word boundary

library(stringr)
library(dplyr)
library(purrr)
df1 %>%
     mutate(two_chars = str_extract_all(list_names, "\\b[a-z]{2}")  %>%
                            map_chr(toString))
#  id                                    list_names                  two_chars
#1  1                                          john                         jo
#2  2                                   adam, sally                     ad, sa
#3  3                                       rebecca                         re
#4  4                          zhang, mike, antonio                 zh, mi, an
#5  5 mark, henry, scott, john, steve, jason, nancy ma, he, sc, jo, st, ja, na

Or using gsub

gsub("\\b([a-z]{2})[^,]+", "\\1", df1$list_names)
#[1] "jo"                         "ad, sa"                     "re"                         "zh, mi, an"                
#[5] "ma, he, sc, jo, st, ja, na"

Upvotes: 3

Related Questions