Jason Mathews
Jason Mathews

Reputation: 275

Get the count of common elements from a column that is unique across two data frames?

I have two data frames among which there is a column that is common across both the data frames. I am trying to get the count of shared elements by other columns. Below is my input and output.

DF 1
        America AmericanSports
    1 American1         Soccer
    2 American1          Chess
    3 American1         Tennis
    4 American2          Chess
    5 American2         Tennis
    6 American3         Soccer
    7 American3         Tennis
DF2
        Russia RussianSports
    1 Russian1        Soccer
    2 Russian1        Tennis
    3 Russian2         Chess
    4 Russian2        Tennis
    5 Russian3         Chess
    6 Russian3        Tennis
    7 Russian3        Soccer

Output
     ANames   RNames Common_Sports_Count
1 American1 Russian1                   2
2 American1 Russian2                   2
3 American1 Russian3                   2
4 American2 Russian1                   1
5 American2 Russian2                   2
6 American2 Russian3                   3
7 American3 Russian1                   2
8 American3 Russian2                   1
9 American3 Russian3                   2

I tried inner joining the input data frame (using merge) but just not able to get the logic right. I know it is simple but just not able to think the right logic.

Upvotes: 0

Views: 57

Answers (2)

Parfait
Parfait

Reputation: 107567

Simply merge then aggregate:

mdf <- merge(df1, df2, by.x="AmericanSports", by.y="RussianSports")
aggdf <- aggregate(.~America+Russia, mdf, FUN=length)

names(aggdf)[ncol(aggdf)] <- "Common_Sports_Count"

aggdf
#     America   Russia Common_Sports_Count
# 1 American1 Russian1                   2
# 2 American2 Russian1                   1
# 3 American3 Russian1                   2
# 4 American1 Russian2                   2
# 5 American2 Russian2                   2
# 6 American3 Russian2                   1
# 7 American1 Russian3                   3
# 8 American2 Russian3                   2
# 9 American3 Russian3                   2

For faster aggregation on large datasets, consider the data.table package:

library(data.table)

dt <- data.table(mdf)
aggdt <- dt[, list(Common_Sports_Count=length(AmericanSports)), by='America,Russia']
aggdt

#      America   Russia Common_Sports_Count
# 1: American1 Russian2                   2
# 2: American1 Russian3                   3
# 3: American2 Russian2                   2
# 4: American2 Russian3                   2
# 5: American1 Russian1                   2
# 6: American3 Russian1                   2
# 7: American3 Russian3                   2
# 8: American2 Russian1                   1
# 9: American3 Russian2                   1

Upvotes: 2

BENY
BENY

Reputation: 323226

By using dplyr(PS:Drop column by using df[,'columns']=NULL)

library(dplyr)
df1=df1%>%group_by(America)%>%dplyr::summarise(Sports=list(AmericanSports))
df2=df2%>%group_by(Russia)%>%dplyr::summarise(Sports=list(RussianSports))
df=expand.grid(df1$America,df2$Russia)
df=merge(df,df1,by.x='Var1',by.y='America')
df=merge(df,df2,by.x='Var2',by.y='Russia')

df['Common_Sports_Count']=apply(df[,c('Sports.x','Sports.y')],1,function(x) length(intersect(x[1][[1]],x[2][[1]])))

df
      Var2      Var1              Sports.x              Sports.y Common_Sports_Count
1 Russian1 American1 Soccer, Chess, Tennis        Soccer, Tennis                   2
2 Russian1 American2         Chess, Tennis        Soccer, Tennis                   1
3 Russian1 American3        Soccer, Tennis        Soccer, Tennis                   2
4 Russian2 American1 Soccer, Chess, Tennis         Chess, Tennis                   2
5 Russian2 American3        Soccer, Tennis         Chess, Tennis                   1
6 Russian2 American2         Chess, Tennis         Chess, Tennis                   2
7 Russian3 American1 Soccer, Chess, Tennis Chess, Tennis, Soccer                   3
8 Russian3 American3        Soccer, Tennis Chess, Tennis, Soccer                   2
9 Russian3 American2         Chess, Tennis Chess, Tennis, Soccer                   2

EDIT for getting your result

df=df[,c('Var2','Var1','Common_Sports_Count')]

names(df)=c('Russia','America','Common_Sports_Count')

df[order(df$America),]

    Russia   America Common_Sports_Count
1 Russian1 American1                   2
4 Russian2 American1                   2
7 Russian3 American1                   3
2 Russian1 American2                   1
6 Russian2 American2                   2
9 Russian3 American2                   2
3 Russian1 American3                   2
5 Russian2 American3                   1
8 Russian3 American3                   2

Upvotes: 1

Related Questions