missy morrow
missy morrow

Reputation: 337

Subset column names with specific string

I am trying to subset a dataframe based on column names starting with a particular string. I have some columns which are like ABC_1 ABC_2 ABC_3 and some like ABC_XYZ_1, ABC_XYZ_2, ABC_XYZ_3

How can I subset my dataframe such that it contains only ABC_1, ABC_2, ABC_3 ...ABC_n columns and not the ABC_XYZ_1, ABC_XYZ_2...?

I have tried this option

set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
            ABC_2 = sample(0:1,3,repl = TRUE),
            ABC_XYZ_1 = sample(0:1,3,repl = TRUE),
            ABC_XYZ_2 = sample(0:1,3,repl = TRUE) )


df1 <- df[ , grepl( "ABC" , names( df ) ) ]

ind <- apply( df1 , 1 , function(x) any( x > 0 ) )

df1[ ind , ]

but this gives me both the column names with ABC_1...ABC_n ...and ABC_XYZ_1...ABC_XYZ_n... I am not interested in ABC_XYZ_1 columns , only columns with ABC_1,.... Any suggestion is much appreciated.

Upvotes: 1

Views: 3380

Answers (2)

Jota
Jota

Reputation: 17611

To specify "ABC_" followed by a one or more digits (i.e. \\d+ or [0-9]+), you can use

df1 <- df[ , grepl("ABC_\\d+", names( df ), perl = TRUE ) ]
# df1 <- df[ , grepl("ABC_[0-9]+", names( df ), perl = TRUE ) ] # another option

To force the column names to start with "ABC_" you can add ^ to the regex to match only when "ABC_\d+" occurs at the start of the string as opposed to occurring anywhere within it.

df1 <- df[ , grepl("^ABC_\\d+", names( df ), perl = TRUE ) ]

If dplyr is more to your liking, you might try

library(dplyr)
select(df, matches("^ABC_\\d+"))

Upvotes: 6

hvollmeier
hvollmeier

Reputation: 2986

Another straightforward solution would be using substr :

df1 <- df[,substr(names(df),5,7) != 'XYZ']

Upvotes: 0

Related Questions