Reed
Reed

Reputation: 308

Write a function to get later elements from str_split()

I've received raft of data sets with multiple pieces of data in a single column recently and a like the title suggests I'm trying to write a function to return some of the later split elements. Hunting around I've seen solutions on how to get just the first element, or just the last but not how to select which elements are returned. This looks like a persistent issues in these data sets so a solution that I can abstract would would be delightful.

Example: Ideally this function would return just the binomial names of these organism, but I don't want it anchored to the back of the string as some times there is more unneeded information after the names

library(tidyverse)

foo <-  data.frame(id = paste0("a", 1:6),
                      Organisms = c("EA - Enterobacter aerogenes",  "EA - Enterobacter aerogenes",
                                    "KP - Klebsiella pneumoniae", "ACBA - Acinetobacter baumannii",
                                    "ENC - Enterobacter cloacae", "KP - Klebsiella pneumoniae")) 
 ## just the first element (does not allow you to select 2 elements)                    
Orgsplit_abrev <- function(x){
  sapply(str_split(x," "), getElement, 1)
}

foo %>%
  summarise(Orgsplit_abrev(Organisms))


str_split(foo$Organisms, " ")[[1]][c(3,4)]

Upvotes: 1

Views: 107

Answers (2)

LostInTranslation
LostInTranslation

Reputation: 159

Why don't you split using the "-" delimiter?

> str_split(foo$Organisms, "-") %>% do.call('rbind', .)
     [,1]    [,2]                      
[1,] "EA "   " Enterobacter aerogenes" 
[2,] "EA "   " Enterobacter aerogenes" 
[3,] "KP "   " Klebsiella pneumoniae"  
[4,] "ACBA " " Acinetobacter baumannii"
[5,] "ENC "  " Enterobacter cloacae"   
[6,] "KP "   " Klebsiella pneumoniae"

tail is also a good idea, but I would use -2 instead of 2 to keep everything but the first two elements (hence it allows more messy names to be fully included):

Orgsplit_abrev <- function(x){
  lapply(str_split(x," "), tail, -2)
}

or with the lambda function

Orgsplit_abrev <- function(x){
  lapply(str_split(x," "), function(x) x[3:])
}

Upvotes: 0

akrun
akrun

Reputation: 887088

We may use tail - as there are more than one element to be returned, return as a list column

Orgsplit_abrev <- function(x){
  lapply(str_split(x," "), tail, 2)
}

-testing

foo %>%
   summarise(Orgsplit_abrev(Organisms))
Orgsplit_abrev(Organisms)
1   Enterobacter, aerogenes
2   Enterobacter, aerogenes
3    Klebsiella, pneumoniae
4  Acinetobacter, baumannii
5     Enterobacter, cloacae
6    Klebsiella, pneumoniae

Also, if we want to specify the index, create a lambda function

Orgsplit_abrev <- function(x){
  lapply(str_split(x," "), function(x) x[c(3, 4)])
}

Or may also use Extract with [

Orgsplit_abrev <- function(x){
   lapply(str_split(x," "),`[`, c(3, 4))
 }

Upvotes: 3

Related Questions