peer
peer

Reputation: 277

Reduce data without loss of information

Suppose i do have two data tables

df1
       x   y    f(x,y)
1      a   A    3
2      b   E    4
3      a   E    5
4      b   A    2

and

df2
       x   y    f(x,y)
1      a   A    4
2      b   E    4
3      a   E    4
4      b   A    2

If we interpret the columns x and y as the influence on some results, than we can say that in the second example (df2) the outcome is independent of the column y for x = a. What i would like to do for generating a report is to drop all the columns which don't influence the outcome, hence i would like to create df2_out instead of df2 (in order to avoid some large tables)

df2_out
       x   y    f(x,y)
1      a   -    4
2      b   E    4
3      b   A    2

whereas df1 should stay as it is since x,y has an influence on the outcome:

df1_out
       x   y    f(x,y)
1      a   A    3
2      b   E    4
3      a   E    5
4      b   A    2

How can i achieve this in R? Is there any better way to print the data table?

Upvotes: 0

Views: 125

Answers (1)

andrew_reece
andrew_reece

Reputation: 21274

Your expected output indicates you are only interested in adjusting cases where the outcome of f() is independent of y. You can use dplyr methods to do this:

library(dplyr)

find_independent <- function(data) {
  data %>%
    inner_join(data %>% 
                 group_by(x, f) %>% 
                 count(), 
               by=c("x", "f")) %>% 
    mutate(y = if_else(n == 2, "_", y)) %>%
    distinct()
}

find_independent(df1)
  x y f
1 a A 3
2 b E 4
3 a E 5
4 b A 2

find_independent(df2)
  x y f
1 a _ 4
2 b E 4
3 b A 2

Explanation (using df2 as an example):

  • First, group_by x and f and count the number of occurrences.

    df2 %>% group_by(x, f) %>% count()
    # A tibble: 3 x 3
    # Groups:   x, f [3]
      x         f     n
      <chr> <int> <int>
    1 a         4     2
    2 b         2     1
    3 b         4     1
    
  • Merge this count back to the original data frame, and for the rows where n == 2, change the value of y to _.

  • Drop duplicate rows (which will be the rows where y has no effect on f) using distinct().

Data:

df1 <- structure(list(x = c("a", "b", "a", "b"), y = c("A", "E", "E", 
"A"), f = c(3L, 4L, 5L, 2L)), class = "data.frame", row.names = c(NA, 
-4L))
df2 <- structure(list(x = c("a", "b", "a", "b"), y = c("A", "E", "E", 
"A"), f = c(4L, 4L, 4L, 2L)), class = "data.frame", row.names = c(NA, 
-4L))

Upvotes: 2

Related Questions