Reputation: 816
In R, we can use the cor
function to get the correlation between two columns but it doesn't work for non-numerical values.
I ask this because I need to preprocess some data, and I suspect 2 columns being very similar because by looking, I found that when the first column says "A", the second column always says "B" but I want to be sure that, indeed, if I know the value in the first column, I can deduce the value in the second.
If i'm not clear here's an exemple to illustrate.
dataframe <- read.csv(file = 'data/company_product.csv')
Where data/company_product.csv is a table like so
Company Name Main Product rest of the data ...
By Apple A phone some_other_data ...
By Apple A phone some_other_data ...
By Microsoft A computer some_other_data ...
By Nokia A tablet some_other_data ...
By Nokia A tablet some_other_data ...
By Nokia A tablet some_other_data ...
... ... ...
As you can see in this file, the column Main Product is useless because if I know the column Company Name is "By Apple", the Main Product will always be "A phone".
This means the column Company Name is highly correlated to the column Main Product, but I do not find a simple way in R to show that
I'm not sure if the answer will be extremely trivial, or if it is a key problem in text mining, but I do not need precise correlation, all I want is a Yes/No for "Every time a value appear in first column, it will always be the same value in the second column"
Thanks
Upvotes: 1
Views: 271
Reputation: 269905
Use table to assess this:
table(df[, 1:2])
giving the following which shows only one non-zero in each row and in each column showing By Apple is associated with A phone, By Microsoft is associated with A computer and By Nokia is associated with A tablet.
second
first A computer A phone A tablet
By Apple 0 2 0
By Microsoft 1 0 0
By Nokia 0 0 2
or simply count the number of times each unique row appears:
aggregate(list(count = df[[1]]), df, length)
## first second count
## 1 By Microsoft A computer 1
## 2 By Apple A phone 2
## 3 By Nokia A tablet 2
or
library(dplyr)
count(df, first, second)
## first second n
## 1 By Apple A phone 2
## 2 By Microsoft A computer 1
## 3 By Nokia A tablet 2
or if you don't care about the count just look at the unique rows:
unique(df[, 1:2])
## first second
## 1 By Apple A phone
## 2 By Microsoft A computer
## 4 By Nokia A tablet
Visualize this as follows:
library(igraph)
g <- graph_from_incidence_matrix(table(df[, 1:2]))
plot(g, layout = layout.bipartite)
Upvotes: 1
Reputation: 102349
Maybe you can try table
, xtabs
or dcast
from data.table
package
> table(df)
second
first A computer A phone A tablet
By Apple 0 2 0
By Microsoft 1 0 0
By Nokia 0 0 2
> xtabs(~ first + second, df)
second
first A computer A phone A tablet
By Apple 0 2 0
By Microsoft 1 0 0
By Nokia 0 0 2
> dcast(data.table::setDT(df), first ~ second)
Using 'second' as value column. Use 'value.var' to override
Aggregate function missing, defaulting to 'length'
first A computer A phone A tablet
1: By Apple 0 2 0
2: By Microsoft 1 0 0
3: By Nokia 0 0 2
Data
dt <- data.frame(
first = c("By Apple", "By Microsoft", "By Apple", "By Nokia", "By Nokia"),
second = c("A phone", "A computer", "A phone", "A tablet", "A tablet")
)
Upvotes: 1