Reputation:
Is there a way for me to subset data based on column names starting with a particular string? I have some columns which are like ABC_1 ABC_2 ABC_3
and some like XYZ_1, XYZ_2,XYZ_3
let's say.
How can I subset my df
based only on columns containing the above portions of text (lets say, ABC
or XYZ
)? I can use indices, but the columns are too scattered in data and it becomes too much of hard coding.
Also, I want to only include rows from each of these columns where any of their value is >0
so if either of the 6
columns above has a 1
in the row, it makes a cut into my final data frame.
Upvotes: 88
Views: 238253
Reputation: 10011
I'll provide alternative solutions to subset one or multiple columns, based on @Simon O'Hanlon's sample data:
> library(tidyverse)
> library(dplyr)
>
> df <- data.frame(D=runif(3),
+ ABC_1 = runif(3),
+ ABC_2 = runif(3),
+ XYZ_1 = runif(3),
+ XYZ_2 = runif(3)
+ )
>
> df %>%
+ dplyr::select(matches('ABC'))
ABC_1 ABC_2
1 0.06445754 0.16957677
2 0.75470562 0.06221405
3 0.62041003 0.10902927
>
> df %>%
+ dplyr::select(matches('ABC|XYZ'))
ABC_1 ABC_2 XYZ_1 XYZ_2
1 0.06445754 0.16957677 0.3817164 0.1922095
2 0.75470562 0.06221405 0.1693109 0.2571700
3 0.62041003 0.10902927 0.2986525 0.1812318
Upvotes: 2
Reputation: 321
Try this (here, looking for variables whose name contains 'date', including all case combinations):
df %>% dplyr::select(contains("date", ignore.case = TRUE))
Upvotes: 4
Reputation: 18632
Many of the tidyselect
options have been mentioned already. contains
and starts_with
work very well with this specific problem. For more complicated conditions/matching you can use matches
, which will select columns using a regular expression match:
library(dplyr)
df %>%
select(matches("^ABC")) # starts with "ABC"
# case insensitive match
df %>%
select(matches("(?i)^abc")) # starts with "ABC", "Abc", "abc", etc.
Upvotes: 0
Reputation: 181
Simplest solution, given to me by my statistics professor:
df[,grep("pattern", colnames(df))]
That's it. It doesn't give you booleans or anything, it just gives you your dataset that follows that pattern.
Upvotes: 18
Reputation: 3
Building off the above, I think it is the most flexible. Note that you'll need to use dplyr, but that's not a terrible thing.
Advantage: You can search for more than "contains
". Here, I use "starts_with
" for a relatively common string "ST". Using "grep
" here could easily have driven you mad; mad, I say!
library(dplyr)
df %>% dplyr::select(starts_with("ST",ignore.case = TRUE))
Upvotes: 0
Reputation: 438
Just in case for data.table
users, the following works for me:
df[, grep("ABC", names(df)), with = FALSE]
Upvotes: 21
Reputation: 1049
This worked for me:
df[,names(df) %in% colnames(df)[grepl(str,colnames(df))]]
Upvotes: 6
Reputation: 167
Using dplyr you can:
df <- df %>% dplyr:: select(grep("ABC", names(df)), grep("XYZ", names(df)))
Upvotes: 15
Reputation: 810
You can also use starts_with
and dplyr
's select()
like so:
df <- df %>% dplyr:: select(starts_with("ABC"))
Upvotes: 47
Reputation: 59970
Try grepl
on the names of your data.frame
. grepl
matches a regular expression to a target and returns TRUE
if a match is found and FALSE
otherwise. The function is vectorised so you can pass a vector of strings to match and you will get a vector of boolean values returned.
# Data
df <- data.frame( ABC_1 = runif(3),
ABC_2 = runif(3),
XYZ_1 = runif(3),
XYZ_2 = runif(3) )
# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0.3792645 0.3614199 0.9793573 0.7139381
#2 0.1313246 0.9746691 0.7276705 0.0126057
#3 0.7282680 0.6518444 0.9531389 0.9673290
# Use grepl
df[ , grepl( "ABC" , names( df ) ) ]
# ABC_1 ABC_2
#1 0.3792645 0.3614199
#2 0.1313246 0.9746691
#3 0.7282680 0.6518444
# grepl returns logical vector like this which is what we use to subset columns
grepl( "ABC" , names( df ) )
#[1] TRUE TRUE FALSE FALSE
To answer the second part, I'd make the subset data.frame and then make a vector that indexes the rows to keep (a logical vector) like this...
set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
ABC_2 = sample(0:1,3,repl = TRUE),
XYZ_1 = sample(0:1,3,repl = TRUE),
XYZ_2 = sample(0:1,3,repl = TRUE) )
# We will want to discard the second row because 'all' ABC values are 0:
# ABC_1 ABC_2 XYZ_1 XYZ_2
#1 0 1 1 0
#2 0 0 1 0
#3 1 1 1 0
df1 <- df[ , grepl( "ABC" , names( df ) ) ]
ind <- apply( df1 , 1 , function(x) any( x > 0 ) )
df1[ ind , ]
# ABC_1 ABC_2
#1 0 1
#3 1 1
Upvotes: 122