Aren Cambre
Aren Cambre

Reputation: 6730

Extracting specific columns from a data frame

I have an R data frame with 6 columns, and I want to create a new data frame that only has three of the columns.

Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:

data.frame(df$A,df$B,df$E)

Is there a more compact way of doing this?

Upvotes: 462

Views: 1763905

Answers (12)

LMc
LMc

Reputation: 18612

Sometimes it is easier to remove columns you do not want than selecting ones that you do. This can be done by using the - operator for indexes, setdiff or subset by name, or ! for logical vectors in base R:

# Column index
df[-c(3, 4)]

# Column name
subset(df, select = -c(C, D))
df[setdiff(names(df), c("C", "D"))]

# Logical vector
df[!names(df) %in% c("C", "D")]

Upvotes: 1

LMc
LMc

Reputation: 18612

TL;DR

If you are using a tibble (commonly used in the tidyverse) you can safely do any of the following to select columns and you will get a tibble back:

library(tibble)
tb <- tibble(A = 1:2, B = 3:4)

# By index
tb[1]
tb[, 1]

tb[1:2]
tb[, 1:2]


# By name
tb["A"]
tb[, "A"]

tb[c("A", "B")]
tb[, c("A", "B")]

This is in addition to the answer given by @Sam Firke which uses the popular select() verb for column selection.

You can use any of these selection operators on base R data frames, but know there are some cases where you should specify drop = FALSE.


There is already some discussion about tidyverse versus base R in other answers, but hopefully this adds something.

You can see from the documentation ?`[.data.frame` (and the answer from @Joshua Ulrich) that data frame columns can be selected several ways. This has to do with the drop argument:

If TRUE the result is coerced to the lowest possible dimension. The default is to drop if only one column is left, but not to drop if only one row is left.

If a single vector is given, then columns are indexed and selection behaves like list selection (the drop argument of [ is ignored). In this case, a data frame is always returned:

df <- data.frame(A = 1:2, B = 3:4)

str(df[1])
# 'data.frame': 2 obs. of  1 variable:
#  $ A: int  1 2

str(df[1:2])
# 'data.frame': 2 obs. of  2 variables:
#  $ A: int  1 2
#  $ B: int  3 4

str(df[c("A", "B")])
# 'data.frame': 2 obs. of  2 variables:
#  $ A: int  1 2
#  $ B: int  3 4

However, if two indicies are given ([row, column]) then selection behaves more like matrix selection. In this case the default argument of [ is drop = TRUE so the result is coerced to the lowest possible dimension only if there is only a single column left:

str(df[1, ]) # single row selection (does not reduce dimension)
# 'data.frame': 1 obs. of  2 variables:
#  $ A: int 1
#  $ B: int 3

str(df[, 1]) # single column selection (does reduce dimension)
# int [1:2] 1 2

Of course you can always change the default behavior by setting drop = FALSE:

str(df[, 1, drop = FALSE])
# 'data.frame': 2 obs. of  1 variable:
#  $ A: int  1 2

In the tidyverse, tibbles are preferred. They are like data frames, but have a few significant differences -- one being column selection. Column selection using tibbles never reduces dimensionality, as shown above:

library(tibble)

tb <- as_tibble(df)
class(tb)
# [1] "tbl_df"     "tbl"        "data.frame"

str(tb[, 1])
# tibble [2 × 1] (S3: tbl_df/tbl/data.frame)
#  $ A: int [1:2] 1 2

str(tb[1])
# tibble [2 × 1] (S3: tbl_df/tbl/data.frame)
#  $ A: int [1:2] 1 2

All the other tibble column selection works as you would expect (above only shows by index, but you can select by name too).

Upvotes: 0

Joshua Ulrich
Joshua Ulrich

Reputation: 176638

You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.

# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]

Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.

str(df["A"])
## 'data.frame':    1 obs. of  1 variable:
## $ A: int 1
str(df[,"A"])  # vector
##  int 1

Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).

# subset (original solution--not recommended)
df[,c("A","B","E")]  # returns a data.frame
df[,"A"]             # returns a vector

Upvotes: 532

Mohamed Rahouma
Mohamed Rahouma

Reputation: 1236

df<- dplyr::select ( df,A,B,C)

Also, you can assign a different name to the newly created data

data<- dplyr::select ( df,A,B,C)

Upvotes: -1

moodymudskipper
moodymudskipper

Reputation: 47300

You can use with :

with(df, data.frame(A, B, E))

Upvotes: 5

Richard Ball
Richard Ball

Reputation: 560

Where df1 is your original data frame:

df2 <- subset(df1, select = c(1, 2, 5))

Upvotes: 23

Aman Burman
Aman Burman

Reputation: 299

You can also use the sqldf package which performs selects on R data frames as :

df1 <- sqldf("select A, B, E from df")

This gives as the output a data frame df1 with columns: A, B ,E.

Upvotes: 15

so860
so860

Reputation: 438

For some reason only

df[, (names(df) %in% c("A","B","E"))]

worked for me. All of the above syntaxes yielded "undefined columns selected".

Upvotes: 22

fxi
fxi

Reputation: 637

[ and subset are not substitutable:

[ does return a vector if only one column is selected.

df = data.frame(a="a",b="b")    

identical(
  df[,c("a")], 
  subset(df,select="a")
) 

identical(
  df[,c("a","b")],  
  subset(df,select=c("a","b"))
)

Upvotes: 0

Sam Firke
Sam Firke

Reputation: 23004

Using the dplyr package, if your data.frame is called df1:

library(dplyr)

df1 %>%
  select(A, B, E)

This can also be written without the %>% pipe as:

select(df1, A, B, E)

Upvotes: 261

St&#233;phane Laurent
St&#233;phane Laurent

Reputation: 84519

This is the role of the subset() function:

> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> subset(dat, select=c("A", "B"))
  A B
1 1 3
2 2 4

Upvotes: 117

Henry
Henry

Reputation: 6784

There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or

df[,c(1,2,5)]

as in

> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> df
  A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
  A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
  A B E
1 1 3 8
2 2 4 8

Upvotes: 92

Related Questions