Reputation: 33
I have a very large data set. I need to find out what variables have the highest percentage of correlations in the data set.
My code is below which shows all my correlations however I have 69 columns so it is impossible to check all of it (ot literally but im sure you can understand what I mean).
The code I am using is below:
File: CW_ModelDevelopment Select only Numericals
CW_ModelDevelopment %>%
# Select numeric columns
select_if(is.numeric) %>%
# Calculate correlation matrix
cor()
Pease can someone help me with getting a code that can show the results in percentages or set a condition where anything that corrolates over x amount should print Example of coding in Python of what I want below:
Upvotes: 0
Views: 600
Reputation: 389047
You can try :
mat <- cor(mtcars) > 0.9
diag(mat) <- FALSE
colnames(mtcars)[which(mat, arr.ind = TRUE)[, 2]]
#[1] "cyl" "disp"
Upvotes: -1
Reputation: 6628
If we handle the cor()
output as an adjacency matrix for a network object,
igraph
can help us to transform the output to a data.frame
structure,
where each variable combination has its own row. Using dplyr::top_n()
we can then see the top ten results and its values.
library(tidyverse)
library(igraph, warn.conflicts = FALSE)
matrix(sample(1:10, 1000, replace = TRUE), 20 , 50) %>%
as_tibble(.name_repair = "universal") %>%
cor() %>%
igraph::graph_from_adjacency_matrix(weighted = TRUE,
diag = FALSE) %>%
igraph::as_data_frame() %>%
top_n(10)
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> * `` -> ...4
#> * `` -> ...5
#> * ...
#> Selecting by weight
#> from to weight
#> 1 ...3 ...34 0.7098358
#> 2 ...5 ...24 0.6054965
#> 3 ...9 ...16 0.6129791
#> 4 ...16 ...9 0.6129791
#> 5 ...21 ...38 0.6092931
#> 6 ...24 ...5 0.6054965
#> 7 ...33 ...42 0.6226324
#> 8 ...34 ...3 0.7098358
#> 9 ...38 ...21 0.6092931
#> 10 ...42 ...33 0.6226324
Upvotes: 2