Reputation: 101
I have 2 data frames in R with epigenetic data. To use one of them as a train set and the other as a test set in the glmnet package, the column number if them have to match. As both of the data frames contain more than 800000 columns, I'm looking for a way to compare the names columns of the 2 data frames so that I can delete the columns that the two don't have in common. So far I just found packages and functions that compare rows of two data frames with each other. As an example, I'm looking for something like this:
df1
participant_code cg123 cg122 cg121 cg120
df2
participant_code cg123 cg122 cg121 cg119
The function would give me then e.g. a table in which it shows me which colnames differ:
colname 5 differs
Upvotes: 5
Views: 8864
Reputation: 11
You could try using the inspectdf package. There is also comparedf in the arsenal package.
Upvotes: 0
Reputation: 717
Your are looking for the intersection of column names of two data frames. You can simply use the command intersect
to achieve what you want. First you extract the names of both data frames. Then you useintersect
. The result of intersect
contains the column names that are in either of the two data frames. Use this object to subset of initial data frames and you're done.
# define data frames with dummy data
df1 <- data.frame(participant_code = 1,
cg123 = 2,
cg122 = 3,
cg121 = 4,
cg120 = 5)
df2 <- data.frame(participant_code = 6,
cg123 = 7,
cg122 = 8,
cg121 = 9,
cg119 = 10)
# extract column names of the data frames
cols_df_1 <- names(df1)
cols_df_2 <- names(df2)
# find the intersection of both column name vectors
cols_intersection <- intersect(cols_df_1, cols_df_2)
# subset the initial data frames
df1_sub <- df1[,cols_intersection]
df2_sub <- df2[,cols_intersection]
# print to console and see result
df1_sub
#participant_code cg123 cg122 cg121
# 1 2 3 4
df2_sub
#participant_code cg123 cg122 cg121
# 6 7 8 9
Upvotes: 5
Reputation:
This might not work the best for a huge data frame, but I have recently become a fan of compare()
from the new waldo
package.
This will show an output of differences between the two. Again, might be indecipherable for 800k length vectors, but I thought it was worth pointing out.
library(waldo)
compare(names(df1), names(df2)
Upvotes: 2
Reputation: 389325
You can use intersect
to get common columns from both the dataframes.
get_common_cols <- function(df1, df2) intersect(names(df1), names(df2))
You can pass both the dataframe in a function to get similar columns and use it to subset the dataframes
common_cols <- get_common_cols(data1, data2)
data1 <- data1[, common_cols]
data2 <- data2[, common_cols]
Upvotes: 3