user2991591
user2991591

Reputation: 47

Remove columns in a dataframe by partial columns characters recognition R

I would like to subset my data frame by selecting columns with partial characters recognition, which works when I have a single "name" to recognize. where the data frame is:

         ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B 
    1908      NA      NA      NA      NA      NA      NA      NA      NA           
    1909      NA      NA      NA      NA      NA      NA      NA      NA          
    1910      NA      NA      NA      NA      NA      NA      NA      NA         
    1911      NA      NA      NA      NA      NA      NA      NA      NA      
    1912      NA      NA      NA      NA      NA      NA      NA      NA      
    1913      NA      NA      NA      NA      NA      NA      NA      NA      

    library(stringr)
    df[str_detect(names(df), "ABBA" )]

works, and returns:

         ABBA01A ABBA01B ABBA02A ABBA02B 
    1908      NA      NA      NA      NA    

So, I would like to create a dataframe for each of my species:

    Speciesnames=unique ( substring (names(df),0, 4))
    Speciesnames
     [1] "ABBA" "ACRU" "ARCU" "PIAB" "PIGL" 

I have tried to make a loop and use [i] as species name but the str_detect funtion does not recognise it. and I would like to add additional calculations in the loop

    for ( i in seq_along(Speciesnames)){

      df=df[str_detect(names(df), pattern =[i])]

      print(df)
     #my function for the subsetted dataframe
    }

thank you for your help!

Upvotes: 1

Views: 290

Answers (3)

MKR
MKR

Reputation: 20085

An option is to use mapply with SIMPLIFY=FALSE to return list of data frames for each species. startsWith function from base-R will provide option to subset columns starting with specie name.

# First find species but taking unique first 4 characters from column names
species <- unique(gsub("([A-Z]{4}).*", "\\1",names(df)))

# Pass each species 
listOfDFs <- mapply(function(x){
  df[,startsWith(names(df),x)]    # Return only columns starting with species
}, species, SIMPLIFY=FALSE)

listOfDFs
# $ABBA
#      ABBA01A ABBA01B ABBA02A ABBA02B
# 1908      NA      NA      NA      NA
# 1909      NA      NA      NA      NA
# 1910      NA      NA      NA      NA
# 1911      NA      NA      NA      NA
# 1912      NA      NA      NA      NA
# 1913      NA      NA      NA      NA
# 
# $ACRU
#      ACRU01A ACRU01B ACRU02A ACRU02B
# 1908      NA      NA      NA      NA
# 1909      NA      NA      NA      NA
# 1910      NA      NA      NA      NA
# 1911      NA      NA      NA      NA
# 1912      NA      NA      NA      NA
# 1913      NA      NA      NA      NA

Data:

df <- read.table(text =  
"ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B 
1908      NA      NA      NA      NA      NA      NA      NA      NA           
1909      NA      NA      NA      NA      NA      NA      NA      NA          
1910      NA      NA      NA      NA      NA      NA      NA      NA         
1911      NA      NA      NA      NA      NA      NA      NA      NA      
1912      NA      NA      NA      NA      NA      NA      NA      NA      
1913      NA      NA      NA      NA      NA      NA      NA      NA",
header = TRUE, stringsAsFactors = FALSE)

Upvotes: 1

phiver
phiver

Reputation: 23598

Using your data you could do the following:

  1. create a list to hold the data.frames to be created.
  2. filter the data.frames and store in the list
  3. give each data.frame the name of of the specie
  4. bring all the data.frames to the global environment out of the list

    Speciesnames <- unique(substring(names(df),0, 4))
    
    data <- vector("list", length(Speciesnames))
    
    for(i in seq_along(Speciesnames)) {
      data[[i]] <- df %>% select(starts_with(Speciesnames[i]))
    }
    names(data) <- Speciesnames
    
    list2env(data, envir = globalenv())
    

The end result after list2envis 2 data.frames called "ABBA" "ACRU" which you then can access. If further manipulation is needed you might leave everything in the list and do it there.

Upvotes: 1

Abderyt
Abderyt

Reputation: 109

I think that you should select all matching columns first, and then subselect your data.frame.

patterns <- c("ABB", "CDC")
res <- lapply(patterns, function(x) grep(x, colnames(df), value=TRUE))
df[, unique(unlist(res))]

res object is a list of matched columns for each pattern

Next step is to select unique set of columns: unique(unlist(res)) and subselect data.frame.

If you are writing production code probably it is not the best answer.

Upvotes: 0

Related Questions