Forge
Forge

Reputation: 1677

select data frame columns even if they are non existent

Loaded a bunch of Excel files as data frames, standarized column names using a dictionnary list and since it's supposed half of columns are destined to be in other data frame, I've tried to split & stack them all using rbind.fill from dplyr package.

The issue is that some of them have 90 columns, some 56, some other just 23, so when selecting columns (I got a list of column names for split1, split2) luckily, you got them all, but some other times you don't.

# this is supposed to be in a loop splitting & stacking all df

    split1 <- df[ ,c("column1","column2","column3")]

    Error in `[.data.frame`(df, , c("column1","column2","column3")) : 
      undefined columns selected

is there any way of selecting columns from a data frame no matter if the mentioned columns are unavailable?

I don't care if columns are filled with NA or any other value (0, NULL...). Just need to make selection properly to stack all split1, split2....

Found some SO answers for not the exact same issue and using data.table package. A base R solution is preferred.

Upvotes: 0

Views: 161

Answers (2)

useRj
useRj

Reputation: 1312

sometimes it's not needed a particular trick or code hint. Just reframing the problem gives you a fairly neat solution.

I'll assume the whole bunch of Excel files got all the needed columns (please let me know if this is not the case).

So, given that all files got all columns....why don't stack all files using dplyr::rbind.fill and after all files are in one data frame, then split.

Pseudocode:

enormous <- data.frame()

for (i in 1:length(dfs)) {

  enormous <- rbind.fill(enormous, i)


} 

# then split

split1 <- subset columns from enormous
split2 <- subset columns from enormous

It's less code and less computer intensive. Take into account that subsetting columns and then splitting must be done per file, but stacking all is easier (yes, you can use the data.tabletrick here) and must be just one splitting step.

is this a valid approach for your question?

Upvotes: 1

akrun
akrun

Reputation: 886938

We can use setdiff to find the elements that are not in the column names and assign it to NA

v1 <- c("column1","column2","column3")
nm1 <- setdiff(v1, names(df))
df[nm1] <- NA
df[v1]

For ease of use, this can be wrapped as a function

fCol <- function(vec, dat){
     nm1 <- setdiff(vec, names(dat))
     dat[nm1] <- NA
     dat[vec]
}
fCol(v1, df)
#    column1 column2 column3
#1       1      NA      NA
#2       2      NA      NA
#3       3      NA      NA
#4       4      NA      NA
#5       5      NA      NA

data

df <- data.frame(column1 = 1:5, col2 = 6:10)

Upvotes: 1

Related Questions