Jibril
Jibril

Reputation: 1037

Matching Data from Different columns / dataframes - Working in R

Here is some sample data

    Dataset A
    id       name      reasonforlogin
    123      Tom       work
    246      Timmy     work
    789      Mark      play

   Dataset B
   id       name      reasonforlogin
   789      Mark      work
   313      Sasha     interview
   000      Meryl     interview
   987      Dara      play
   789      Mark      play
   246      Timmy     work

Two datasets. Same columns. Uneven number of rows.

I want to be able to say something like

1)"I want all of id numbers that appear in both datasetA and datasetB"

or

2)"I want to know how many times any one ID logs in on a day, say day 2."

So the answer to

1) So a list like

    [246, 789]

2) So a data.frame with a "header" of ids, and then a "row" of their login numhbers.

    123, 246, 789, 313, 000, 987

    0, 1, 2, 1, 1, 1

It seems easy, but I think its non-trivial to do this quickly with large data. Originally I planned on doing loops-in-loops, but I'm sure there has to be a term for these kind of comparisons and likely packages that already do similar things.

Upvotes: 0

Views: 46

Answers (3)

kliron
kliron

Reputation: 4673

You need which and table.

1) Find which ids are in both data.frames

common_ids <- unique(df1[which(df1$id %in% df2$id), "id"])

Using intersect as in the other answers is much more elegant in this simple case. which provides however more flexibility when the comparison you need to do is more complicated than simple equality and is worth to know.

2) Find how many times any ID logs in

table(df1$id)

Upvotes: 0

Rich Scriven
Rich Scriven

Reputation: 99371

If we have A as the first data set and B the second, and id as a character column in both so as to keep 000 from being printed as 0, we can do ...

id common to both data sets:

intersect(A$id, B$id)
# [1] "246" "789"

Times an id logged in on the second day (B), including those that were not logged in at all:

table(factor(B$id, levels = unique(c(A$id, B$id))))

# 123 246 789 313 000 987 
#   0   1   2   1   1   1 

Upvotes: 3

bramtayl
bramtayl

Reputation: 4024

You can do both with dplyr

1

A %>% select(id)
  inner_join(B %>% select(id) ) %>%
  distinct

2

B %>% count(id)

Upvotes: 0

Related Questions