R language : Check if two columns containing text are highly correlated

Question

In R, we can use the cor function to get the correlation between two columns but it doesn't work for non-numerical values.

I ask this because I need to preprocess some data, and I suspect 2 columns being very similar because by looking, I found that when the first column says "A", the second column always says "B" but I want to be sure that, indeed, if I know the value in the first column, I can deduce the value in the second.

If i'm not clear here's an exemple to illustrate.

dataframe <- read.csv(file = 'data/company_product.csv')

Where data/company_product.csv is a table like so

Company Name     Main Product       rest of the data    ...
By Apple         A phone            some_other_data     ...
By Apple         A phone            some_other_data     ...
By Microsoft     A computer         some_other_data     ...
By Nokia         A tablet           some_other_data     ...
By Nokia         A tablet           some_other_data     ...
By Nokia         A tablet           some_other_data     ...
...              ...                ...

As you can see in this file, the column Main Product is useless because if I know the column Company Name is "By Apple", the Main Product will always be "A phone".

This means the column Company Name is highly correlated to the column Main Product, but I do not find a simple way in R to show that

I'm not sure if the answer will be extremely trivial, or if it is a key problem in text mining, but I do not need precise correlation, all I want is a Yes/No for "Every time a value appear in first column, it will always be the same value in the second column"

Thanks

G. Grothendieck · Accepted Answer

Use table to assess this:

table(df[, 1:2])

giving the following which shows only one non-zero in each row and in each column showing By Apple is associated with A phone, By Microsoft is associated with A computer and By Nokia is associated with A tablet.

              second
first          A computer A phone A tablet
  By Apple              0       2        0
  By Microsoft          1       0        0
  By Nokia              0       0        2

or simply count the number of times each unique row appears:

aggregate(list(count = df[[1]]), df, length)
##          first     second count
## 1 By Microsoft A computer     1
## 2     By Apple    A phone     2
## 3     By Nokia   A tablet     2

or

library(dplyr)
count(df, first, second)
##          first     second n
## 1     By Apple    A phone 2
## 2 By Microsoft A computer 1
## 3     By Nokia   A tablet 2

or if you don't care about the count just look at the unique rows:

unique(df[, 1:2])
##          first     second
## 1     By Apple    A phone
## 2 By Microsoft A computer
## 4     By Nokia   A tablet

Visualize this as follows:

library(igraph)
g <- graph_from_incidence_matrix(table(df[, 1:2]))
plot(g, layout = layout.bipartite)

R language : Check if two columns containing text are highly correlated

Answers (2)

Related Questions