Vyvanse
Vyvanse

Reputation: 1

How can I extract specific columns from a list of dataframes with lapply?

I have a list containing nine dataframes (called data), each of varying lengths and contents. Consistent across most of them, though, are columns containing information that I want to store in a separate dataframe for later use. These columns are the following:

identifiers <- c("Organism Name", "Protein names", "Gene names", "Pathway", "Biological Process")

I want to iterate through through each element of data to check if it contains the columns I'm interested in, then subset these columns as separate dataframes.

I first tried

lapply(data, '[', identifiers]

The problem with this is that not all of the dfs contain all of the identifiers listed above, so running this returns 'undefined columns selected'.

My next attempt was

lapply(data, function(x) if(identifiers %in% x) '[', identifiers)

which returned a list of 9 (corresponding to the 9 original dataframes) of class NULL. I think that this general method would work with proper execution, but I just can't figure it out.

Any help would be appreciated :)

Upvotes: 0

Views: 665

Answers (1)

r2evans
r2evans

Reputation: 160792

Since identifiers is a vector of column names, some or all of which may be in each frame, we can do:

lapply(data, function(x) x[,intersect(names(x), identifiers),drop=FALSE])

with the understanding that some elements may have zero columns (if none are found).

Your use of if (identifiers %in% x) is not quite right for two reasons:

  1. identifiers %in% x is looking for presence in the data, not in the names, it should be identifiers %in% names(x); and

  2. if requires exactly one logical, but identifiers %in% names(x) is going to return a logical vector the same length as identifiers (i.e., not one). It needs to be summarized.

If it is true that if any of the columns are found, then you will always have all of them, then you can change my code above to be

lapply(data, function(x) if (all(identifiers %in% names(x))) data[,identifiers])

and frames without those columns will return NULL. My use above of intersect also works in this regard, the functional difference being in the case where a frame contains some but not all of them. Over to you which logic you prefer.

Upvotes: 2

Related Questions