Reputation: 1769
I have the following dataframe:
> str(database)
'data.frame': 8547287 obs. of 4 variables:
$ cited_id : num 4.06e+08 5.41e+07 5.31e+07 5.04e+07 3.79e+08 ...
$ cited_pub_year : num 2014 1989 2002 2002 2015 ...
$ citing_id : num 3.34e+08 3.37e+08 4.06e+08 4.19e+08 4.25e+08 ...
$ citing_pub_year: num 2011 2011 2013 2014 2014 ...
The variables cited_id
and citing_id
contain the IDs of the objects from which this database has been obtained.
This is an example of the dataframe:
cited_id cited_pub_year citing_id citing_pub_year
1 405821349 2014 419185055 2011
2 405821349 1989 336621202 2011
3 53148996 2002 406314162 2013
4 53148996 2002 419185055 2014
5 379369076 2015 424901495 2014
6 53148996 2011 441055669 2015
7 405821349 2014 447519383 2015
8 405821349 2015 469644221 2016
9 329268142 2014 470861263 2016
10 45433355 2008 55422577 2008
For example the ID 405821349 has been cited by 419185055, 336621202, 447519383 and 469644221. For each pair of IDs I would like to calculate the intersection of their citing IDs. The quantity Pj.k
below is the length of the intersection. I tried with the following code
total_id<-c(database$cited_id,database$citing_id)
total_id<-unique(total_id)
df<-data.frame(data_k=character(),data_j=character(),Pj.k=numeric(),
stringsAsFactors = F)
for (k in 1:(length(total_id)-1)) {
data_k<-total_id[k]
citing_data_k<-database[database$cited_id==data_k,]
for (j in (k+1):length(total_id)) {
data_j<-total_id[j]
citing_data_j<-database[database$cited_id==data_j,]
Pj.k<-length(intersect(citing_data_j$citing_id,citing_data_k$citing_id))
dfxx=data.frame(data_k=data_k,data_j=data_j,Pj.k=Pj.k,
stringsAsFactors = F)
df<-rbind(df,dfxx)
}
}
Anyway, it takes too long! How could I speed it up?
Upvotes: 0
Views: 62
Reputation: 17001
Using xtabs
, tcrossprod
and sparse matrices:
library(Matrix)
library(data.table)
m2 <- as(
triu(
tcrossprod(
m1 <- xtabs(data = database[,c(1, 3)], sparse = TRUE)
), k = 1
), "TsparseMatrix"
)
df <- data.frame(
data_k = row.names(m1)[attr(m2, "i") + 1L],
data_j = row.names(m1)[attr(m2, "j") + 1L],
Pj.k = attr(m2, "x"),
stringsAsFactors = FALSE
)
Upvotes: 1
Reputation: 17734
Inspired by answers in Count combinations of categorical variables, regardless of order, in R? , count pairs:
database = read.table(header = T, stringsAsFactors = F, text =
"cited_id cited_pub_year citing_id citing_pub_year
1 405821349 2014 419185055 2011
2 405821349 1989 336621202 2011
3 53148996 2002 406314162 2013
4 53148996 2002 419185055 2014
5 379369076 2015 424901495 2014
6 53148996 2011 441055669 2015
7 405821349 2014 447519383 2015
8 405821349 2015 469644221 2016
9 329268142 2014 470861263 2016
10 45433355 2008 55422577 2008")
database |>
dplyr::count(pairs = paste(pmin(cited_id, citing_id),
pmax(cited_id, citing_id)))
#> pairs n
#> 1 329268142 470861263 1
#> 2 336621202 405821349 1
#> 3 379369076 424901495 1
#> 4 405821349 419185055 1
#> 5 405821349 447519383 1
#> 6 405821349 469644221 1
#> 7 45433355 55422577 1
#> 8 53148996 406314162 1
#> 9 53148996 419185055 1
#> 10 53148996 441055669 1
Depending on what you actually need you might find with(database, table(cited_id = cited_id, citing_id = citing_id))
useful too.
Upvotes: 1