Reputation: 275
I have two data frames among which there is a column that is common across both the data frames. I am trying to get the count of shared elements by other columns. Below is my input and output.
DF 1
America AmericanSports
1 American1 Soccer
2 American1 Chess
3 American1 Tennis
4 American2 Chess
5 American2 Tennis
6 American3 Soccer
7 American3 Tennis
DF2
Russia RussianSports
1 Russian1 Soccer
2 Russian1 Tennis
3 Russian2 Chess
4 Russian2 Tennis
5 Russian3 Chess
6 Russian3 Tennis
7 Russian3 Soccer
Output
ANames RNames Common_Sports_Count
1 American1 Russian1 2
2 American1 Russian2 2
3 American1 Russian3 2
4 American2 Russian1 1
5 American2 Russian2 2
6 American2 Russian3 3
7 American3 Russian1 2
8 American3 Russian2 1
9 American3 Russian3 2
I tried inner joining the input data frame (using merge
) but just not able to get the logic right. I know it is simple but just not able to think the right logic.
Upvotes: 0
Views: 57
Reputation: 107567
Simply merge
then aggregate
:
mdf <- merge(df1, df2, by.x="AmericanSports", by.y="RussianSports")
aggdf <- aggregate(.~America+Russia, mdf, FUN=length)
names(aggdf)[ncol(aggdf)] <- "Common_Sports_Count"
aggdf
# America Russia Common_Sports_Count
# 1 American1 Russian1 2
# 2 American2 Russian1 1
# 3 American3 Russian1 2
# 4 American1 Russian2 2
# 5 American2 Russian2 2
# 6 American3 Russian2 1
# 7 American1 Russian3 3
# 8 American2 Russian3 2
# 9 American3 Russian3 2
For faster aggregation on large datasets, consider the data.table
package:
library(data.table)
dt <- data.table(mdf)
aggdt <- dt[, list(Common_Sports_Count=length(AmericanSports)), by='America,Russia']
aggdt
# America Russia Common_Sports_Count
# 1: American1 Russian2 2
# 2: American1 Russian3 3
# 3: American2 Russian2 2
# 4: American2 Russian3 2
# 5: American1 Russian1 2
# 6: American3 Russian1 2
# 7: American3 Russian3 2
# 8: American2 Russian1 1
# 9: American3 Russian2 1
Upvotes: 2
Reputation: 323226
By using dplyr
(PS:Drop column by using df[,'columns']=NULL
)
library(dplyr)
df1=df1%>%group_by(America)%>%dplyr::summarise(Sports=list(AmericanSports))
df2=df2%>%group_by(Russia)%>%dplyr::summarise(Sports=list(RussianSports))
df=expand.grid(df1$America,df2$Russia)
df=merge(df,df1,by.x='Var1',by.y='America')
df=merge(df,df2,by.x='Var2',by.y='Russia')
df['Common_Sports_Count']=apply(df[,c('Sports.x','Sports.y')],1,function(x) length(intersect(x[1][[1]],x[2][[1]])))
df
Var2 Var1 Sports.x Sports.y Common_Sports_Count
1 Russian1 American1 Soccer, Chess, Tennis Soccer, Tennis 2
2 Russian1 American2 Chess, Tennis Soccer, Tennis 1
3 Russian1 American3 Soccer, Tennis Soccer, Tennis 2
4 Russian2 American1 Soccer, Chess, Tennis Chess, Tennis 2
5 Russian2 American3 Soccer, Tennis Chess, Tennis 1
6 Russian2 American2 Chess, Tennis Chess, Tennis 2
7 Russian3 American1 Soccer, Chess, Tennis Chess, Tennis, Soccer 3
8 Russian3 American3 Soccer, Tennis Chess, Tennis, Soccer 2
9 Russian3 American2 Chess, Tennis Chess, Tennis, Soccer 2
EDIT for getting your result
df=df[,c('Var2','Var1','Common_Sports_Count')]
names(df)=c('Russia','America','Common_Sports_Count')
df[order(df$America),]
Russia America Common_Sports_Count
1 Russian1 American1 2
4 Russian2 American1 2
7 Russian3 American1 3
2 Russian1 American2 1
6 Russian2 American2 2
9 Russian3 American2 2
3 Russian1 American3 2
5 Russian2 American3 1
8 Russian3 American3 2
Upvotes: 1