M.A
M.A

Reputation: 65

Count occurrences of a variable having two given values corresponding to one value of another variable

As it can be seen in the picture, I have a column with order numbers and a column with material numbers.

I want to find how often a pair of materials occur in the same order.

The problem is that I have 30000 rows of order numbers and 700 unique material numbers. Is it even possible?

I was thinking if it was easier to make a matrix with the 700 material numbers both in rows and column, and count number occurrences.

PICTURE

EDIT: The first picture was not a good example. I uploaded this second picture with random material numbers. So I want it to count for each pair (example 10-11, as I highlighted), how many times the appear in the same order. As it can be seen, 10&11 appear in 3 different orders.

PICTURE2

Upvotes: 1

Views: 329

Answers (3)

purpleblade98
purpleblade98

Reputation: 87

Building on the awesome code from @Syncrossus, you can with a couple small edits, get the results repeated across the table (i.e. since each combination has two cells where it meets, fill in that count for both cells, looks cleaner imo), as well as add a column to count all the ones that have no match.

To do the extra column, be sure to set

Pairs <- as.data.frame(matrix(data = 0, nrow = number_materials, ncol = number_materials + 1))
colnames(Pairs) <- c(all_materials, "By itself")

From there, if you keep the rest of the code prior to the last chunk the same, you can edit that final for loop as follows:

for(order in levels(as.factor(X$Order.number))){
  # getting the materials in each order
  materials_for_order = X[X$Order.number==order, "Materials"]
  if (length(materials_for_order)>1) {
    # finding each possible pair from the materials list
    all_pairs_in_order = combn(x=materials_for_order, m=2)
    # incrementing the cell at the line and column corresponding to each pair
    for(i in 1:ncol(all_pairs_in_order)){
      Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] =
Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] + 1

    Pairs[all_pairs_in_order[2, i], all_pairs_in_order[1, i]] =
Pairs[all_pairs_in_order[2, i], all_pairs_in_order[1, i]] + 1
      }
    }
else if(length(materials_for_order)== 1){
    Pairs[materials_for_order, 7] = Pairs[pairwise_for_gene, 7] + 1
  }
}

Upvotes: 0

Syncrossus
Syncrossus

Reputation: 626

The optimal solution in terms of memory space would be one row for each pair which would be 700*699 / 2. This problem is still relatively small and the simplicity of manipulating a 700*700 matrix is probably more valuable than the 700*701/2 cells you're saving, which would work out to 240kB with one byte per cell. It could be even less if the matrix is sparse (i.e. most pairs of materials are never ordered together) and you use an appropriate data structure.

Here's how the code would look like:

First we want to create a dataframe with as many rows and columns as there are materials. Matrices are easier to create so we create one that we convert to a dataframe afterwards.

all_materials = levels(as.factor(X$Materials))
number_materials = length(all_materials)
Pairs <- as.data.frame(matrix(data = 0, nrow = number_materials, ncol = number_materials))

(Here, X is your dataset)

We then set the row names and column names to be able to access the rows and columns directly with the identifiers of the materials which are apparently not necessarily numbered from 1 to 700.

colnames(Pairs) <- all_materials
rownames(Pairs) <- all_materials

Then we iterate over the dataset

for(order in levels(as.factor(X$Order.number))){
  # getting the materials in each order
  materials_for_order = X[X$Order.number==order, "Materials"]
  if (length(materials_for_order)>1) {
    # finding each possible pair from the materials list
    all_pairs_in_order = combn(x=materials_for_order, m=2)
    # incrementing the cell at the line and column corresponding to each pair
    for(i in 1:ncol(all_pairs_in_order)){
      Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] = Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] + 1
    }
  }
}

At the end of the loop, the Pairs table should contain everything you need.

Upvotes: 0

chinsoon12
chinsoon12

Reputation: 25225

Here is a data.table solution

library(data.table)    
combis <- data.table(do.call(rbind, 
    DT[, if (.N > 1) list(combn(Materials, 2, simplify=FALSE)), by=Order.number]$V1
))
ans <- combis[, .N, by=.(V1, V2)]

#check results
setorder(ans, V1, V2)
ans

And a base method:

allComb <- by(DT, DT$Order.number, function(x) {
    if (nrow(x) > 1) {
        return(combn(x$Materials, 2, simplify=FALSE)))
    }
    NULL
}
materialsPairs <- as.data.frame(do.call(rbind, unlist(allComb, recursive=FALSE)))

#https://stackoverflow.com/a/18201245/1989480
res <- aggregate(cnt ~ ., data=transform(materialsPairs, cnt=1), length)

#check results
head(res[order(res$V1, res$V2),])

data:

library(data.table)
set.seed(0L)
M <- 30e3
nOrd <- 3000
DT <- data.table(Order.number=sample(nOrd, M, replace=TRUE), 
    Materials=sample(700, M, replace=TRUE))
setorder(DT, Order.number, Materials)

Upvotes: 0

Related Questions