Christopher
Christopher

Reputation: 3

Get intersection in data.frame of some variables without omitting others

I have a huge dataframe (15 million rows), e.g.

    data = data.frame(
       human = c(1,0,0,1,1,0,0,0,0,1,1),
       hair = c(3,1,5,3,1,1,3,4,4,5,5),
       eye_colour = c(1,4,2,1,4,3,1,3,3,3),
       fuel = c(1,2,3,3,4,7,5,6,1,4,6)
    )

and I want to find the intersection for human being 0 and 1 of hair and eye_colour (so only if hair and eye_colour are the same for at least human==0 and human==1, I want to keep the row) and mark it with a cyclon_individual. So for my application one cyclon_individual is somebody, who is at least once recorded as human==1 and human==0 and has same hair and eye_colour coding, i.e. the following result:

    cyclon_individual human hair eye_colour fuel
    1                 1     3    1          1
    1                 1     3    1          3
    1                 0     3    1          5
    2                 0     1    4          2
    2                 1     1    4          4

I think, I have taken an awkward way, and yet I haven't found a clever way to code the cyclon_individual with dplyr:

    require('dplyr')
    hum = subset(data, human == 1)
    non_hum = subset(data, human == 0)
    feature_intersection = c("hair", "eye_colour")

    cyclon = intersect(hum[,feature_intersection],non_hum[,feature_intersection])
    cyclon_data = cyclon %>%
                    rowwise() %>%
                    do(filter(data,hair==.$hair,eye_colour==.$eye_colour))

So is there a more direct way to get to cyclon_data, since the current coding will take at least 26h? And is there a clever way to include the variable cyclon_individual without using a loop by going through all rows of cyclon?

Upvotes: 0

Views: 290

Answers (2)

akrun
akrun

Reputation: 887501

We can use n_distinct from dplyr

library(dplyr)
data %>%
  group_by(hair, eye_color) %>%
  filter(n_distinct(human) > 1)

Upvotes: 0

Sotos
Sotos

Reputation: 51592

You can simply group by hair and eye_color and keep the ones where human has both 0 and 1, i.e.

library(dplyr)

data %>% 
 group_by(hair, eye_colour) %>% 
 filter(length(unique(human)) > 1)

which gives,

# A tibble: 5 x 4
# Groups:   hair, eye_colour [2]
  human  hair eye_colour  fuel
  <dbl> <dbl>      <dbl> <dbl>
1     1     3          1     1
2     0     1          4     2
3     1     3          1     3
4     1     1          4     4
5     0     3          1     5 

Upvotes: 2

Related Questions