Reputation: 67
I've got 2 data frames that contain the same columns. One is called newGames, the other is called existingGames. What I would like to do is check comapare the newGames game_pk and sv_id columns between the two data frames. If the game_pk and sv_id exist in both table I would like to remove that row from the newGames data frame.
N.game_pk N.sv_id E.game_pk E.sv_id
1 440697 123456_789012 NA NA
2 440698 123456_789012 440698 NA
3 440699 123456_789012 440699 123456_789012
The e. stands for the existingGames data frame and N. stands for the newGames data frame. What I would like to have in the end is a newGames data frame that shows all the columns where e.game_pk AND e.sv_id (WHOLE GAME) or e.sv_id (single pitch from game) do not exist in the existingGames data frame. Sv_id is a time stamp so it is possible to have the same value for different game_pk's, so "combining" the fields makes the row unique.
this works for missing games (440697)
newGames2 <- newGames[!(newGames$game_pk %in% existingGames$game_pk),]
but I tried
newGames2 <- newGames[!(newGames$game_pk %in% existingGames$game_pk) | (!(newGames$game_pk %in% existingGames$game_pk) & !(newGames$sv_id %in% existingGames$sv_id)),]
and it just selects the full data frame instead of the single missing pitch.
Upvotes: 0
Views: 87
Reputation: 16277
Here's how to do that with anti_join
from dplyr
:
new_game <- read.table(text="game_pk sv_id
440697 123456_789012
440698 123456_789012
440699 123456_789012",header=TRUE,stringsAsFactors=FALSE)
existing_game <- read.table(text="game_pk sv_id
NA NA
440698 NA
440699 123456_789012",header=TRUE,stringsAsFactors=FALSE)
library(dplyr)
anti_join(new_game,existing_game)
game_pk sv_id
1 440698 123456_789012
2 440697 123456_789012
Upvotes: 2