Reputation: 6730
I have an R data frame with 6 columns, and I want to create a new data frame that only has three of the columns.
Assuming my data frame is df
, and I want to extract columns A
, B
, and E
, this is the only command I can figure out:
data.frame(df$A,df$B,df$E)
Is there a more compact way of doing this?
Upvotes: 462
Views: 1763905
Reputation: 18612
Sometimes it is easier to remove columns you do not want than selecting ones that you do. This can be done by using the -
operator for indexes, setdiff
or subset
by name, or !
for logical vectors in base R:
# Column index
df[-c(3, 4)]
# Column name
subset(df, select = -c(C, D))
df[setdiff(names(df), c("C", "D"))]
# Logical vector
df[!names(df) %in% c("C", "D")]
Upvotes: 1
Reputation: 18612
TL;DR
If you are using a tibble (commonly used in the tidyverse) you can safely do any of the following to select columns and you will get a tibble back:
library(tibble)
tb <- tibble(A = 1:2, B = 3:4)
# By index
tb[1]
tb[, 1]
tb[1:2]
tb[, 1:2]
# By name
tb["A"]
tb[, "A"]
tb[c("A", "B")]
tb[, c("A", "B")]
This is in addition to the answer given by @Sam Firke which uses the popular select()
verb for column selection.
You can use any of these selection operators on base R data frames, but know there are some cases where you should specify drop = FALSE
.
There is already some discussion about tidyverse versus base R in other answers, but hopefully this adds something.
You can see from the documentation ?`[.data.frame`
(and the answer from @Joshua Ulrich) that data frame columns can be selected several ways. This has to do with the drop
argument:
If
TRUE
the result is coerced to the lowest possible dimension. The default is to drop if only one column is left, but not to drop if only one row is left.
If a single vector is given, then columns are indexed and selection behaves like list selection (the drop
argument of [
is ignored). In this case, a data frame is always returned:
df <- data.frame(A = 1:2, B = 3:4)
str(df[1])
# 'data.frame': 2 obs. of 1 variable:
# $ A: int 1 2
str(df[1:2])
# 'data.frame': 2 obs. of 2 variables:
# $ A: int 1 2
# $ B: int 3 4
str(df[c("A", "B")])
# 'data.frame': 2 obs. of 2 variables:
# $ A: int 1 2
# $ B: int 3 4
However, if two indicies are given ([row, column]
) then selection behaves more like matrix selection. In this case the default argument of [
is drop = TRUE
so the result is coerced to the lowest possible dimension only if there is only a single column left:
str(df[1, ]) # single row selection (does not reduce dimension)
# 'data.frame': 1 obs. of 2 variables:
# $ A: int 1
# $ B: int 3
str(df[, 1]) # single column selection (does reduce dimension)
# int [1:2] 1 2
Of course you can always change the default behavior by setting drop = FALSE
:
str(df[, 1, drop = FALSE])
# 'data.frame': 2 obs. of 1 variable:
# $ A: int 1 2
In the tidyverse, tibbles are preferred. They are like data frames, but have a few significant differences -- one being column selection. Column selection using tibbles never reduces dimensionality, as shown above:
library(tibble)
tb <- as_tibble(df)
class(tb)
# [1] "tbl_df" "tbl" "data.frame"
str(tb[, 1])
# tibble [2 × 1] (S3: tbl_df/tbl/data.frame)
# $ A: int [1:2] 1 2
str(tb[1])
# tibble [2 × 1] (S3: tbl_df/tbl/data.frame)
# $ A: int [1:2] 1 2
All the other tibble column selection works as you would expect (above only shows by index, but you can select by name too).
Upvotes: 0
Reputation: 176638
You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()
), especially when programming in functions, packages, or applications.
# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]
Note there's no comma (i.e. it's not df[,c("A","B","C")]
). That's because df[,"A"]
returns a vector, not a data frame. But df["A"]
will always return a data frame.
str(df["A"])
## 'data.frame': 1 obs. of 1 variable:
## $ A: int 1
str(df[,"A"]) # vector
## int 1
Thanks to David Dorchies for pointing out that df[,"A"]
returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).
# subset (original solution--not recommended)
df[,c("A","B","E")] # returns a data.frame
df[,"A"] # returns a vector
Upvotes: 532
Reputation: 1236
df<- dplyr::select ( df,A,B,C)
Also, you can assign a different name to the newly created data
data<- dplyr::select ( df,A,B,C)
Upvotes: -1
Reputation: 560
Where df1 is your original data frame:
df2 <- subset(df1, select = c(1, 2, 5))
Upvotes: 23
Reputation: 299
You can also use the sqldf
package which performs selects on R data frames as :
df1 <- sqldf("select A, B, E from df")
This gives as the output a data frame df1
with columns: A, B ,E.
Upvotes: 15
Reputation: 438
For some reason only
df[, (names(df) %in% c("A","B","E"))]
worked for me. All of the above syntaxes yielded "undefined columns selected".
Upvotes: 22
Reputation: 637
[
and subset are not substitutable:
[
does return a vector if only one column is selected.
df = data.frame(a="a",b="b")
identical(
df[,c("a")],
subset(df,select="a")
)
identical(
df[,c("a","b")],
subset(df,select=c("a","b"))
)
Upvotes: 0
Reputation: 23004
Using the dplyr package, if your data.frame is called df1
:
library(dplyr)
df1 %>%
select(A, B, E)
This can also be written without the %>%
pipe as:
select(df1, A, B, E)
Upvotes: 261
Reputation: 84519
This is the role of the subset()
function:
> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> subset(dat, select=c("A", "B"))
A B
1 1 3
2 2 4
Upvotes: 117
Reputation: 6784
There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")]
or
df[,c(1,2,5)]
as in
> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> df
A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
A B E
1 1 3 8
2 2 4 8
Upvotes: 92