goldisfine
goldisfine

Reputation: 4850

Arrange dataframe for pairwise correlations

I am working with data in the following form:

Country   Player    Goals
"USA"     "Tim"     0
"USA"     "Tim"     0
"USA"     "Dempsey" 3
"USA"     "Dempsey" 5
"Brasil"  "Neymar"  6
"Brasil"  "Neymar"  2
"Brasil"  "Hulk"    5
"Brasil"  "Luiz"    2
"England" "Rooney"  4
"England" "Stewart" 2

Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:

Player_1 Player_2
0        8 # Tim     Dempsey
8        5 # Neymar  Hulk
8        2 # Neymar  Luiz
5        2 # Hulk    Luiz
4        2 # Rooney  Stewart

(You can ignore the comments, they are there simply to clarify what each row contains).

How would I do this?

table(df$player) 

gets me the number of goals per player, but then how to I generate these pairwise combinations?

Upvotes: 2

Views: 252

Answers (2)

J.R.
J.R.

Reputation: 3878

There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:

A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
    count<-players_in_c[i]
    pair<-combn(count,m=2)
    B<-A[A[,2]==i,]
    dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat

> dat
    df$Player df$Country df$Goals df$Player df$Country df$Goals
1        Hulk     Brasil        5      Luiz     Brasil        2
1.1      Hulk     Brasil        5    Neymar     Brasil        8
2        Luiz     Brasil        2    Neymar     Brasil        8
4      Rooney    England        4   Stewart    England        2
6     Dempsey        USA        8       Tim        USA        0

Upvotes: 1

Gregor Thomas
Gregor Thomas

Reputation: 145755

This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.

library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals)) 

> df
Source: local data frame [7 x 3]
Groups: Player

   Player Country Goals
1 Dempsey     USA     8
2    Hulk  Brasil     5
3    Luiz  Brasil     2
4  Neymar  Brasil     8
5  Rooney England     4
6 Stewart England     2
7     Tim     USA     0

Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.

df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))

> df2
   Country Player.x Goals.x Player.y Goals.y
2   Brasil     Hulk       5     Luiz       2
3   Brasil     Hulk       5   Neymar       8
6   Brasil     Luiz       2   Neymar       8
11 England   Rooney       4  Stewart       2
15     USA  Dempsey       8      Tim       0

The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.

Upvotes: 3

Related Questions