Satish Vadlamani
Satish Vadlamani

Reputation: 109

Find columns that have identical values

Problem statement: I actually want to eliminate from further analysis columns that have identical values in all cells. In order to do this, I want to find the columns that have identical values.

I wrote the following code which seems to be working for the dataframe test but not for the real dataframe stpo


    library("dplyr")
    library("purrr")
    test_unique <- function(x)
    {
      return(length(unique(x)))
    }
    
    test <-data.frame(c1 = c("a", "a"), c2 = c(NA, NA), c3 =  c(1,2), c4=c(NA, 4))
    # What I want to find out the columns that have the same value throughout
    res <- map(test[,c(names(test))], test_unique)
    res
    # But when I try to apply the same thing to the dataset below, it does not work. 
    # Not sure what the reason is. Is there a better way to do this? Perhaps using data.table? What am I doing wrong?
    res2 <- map(stpo[,c(names(stpo))], test_unique)
    res2

   
    I am not exactly sure how to put the result of dput. I am putting this below (this is the dataframe stpo)

structure(list(stlnr = c(1L, 2L, 3L, 3L, 3L, 3L, 4L), stlkn = c(1L, 
1L, 1L, 2L, 3L, 4L, 5L), stpoz = c(2L, 2L, 2L, 4L, 6L, 8L, 10L
), aennr = c(NA, NA, NA, NA, NA, NA, NA), vgknt = c(0L, 0L, 0L, 
0L, 0L, 0L, 0L), idnrk = c("test_1", "test_1", "test_2", "test_3", 
"test_3", "test_1", "test_2"), pswrk = c(NA, NA, NA, NA, NA, 
NA, NA), meins = c("EA", "EA", "EA", "EA", "EA", "EA", "EA"), 
    menge = c(1, 14, 4, 4, 2, 2, 1), fmeng = c(NA, NA, NA, NA, 
    NA, NA, NA), ausch = c(0, 0, 0, 0, 0, 0, 0), avoau = c(0, 
    0, 0, 0, 0, 0, 0), netau = c(NA, NA, NA, NA, NA, NA, NA), 
    erskz = c(NA, NA, NA, NA, NA, NA, NA), rekri = c(NA, NA, 
    NA, NA, NA, NA, NA), rekrs = c(NA, NA, NA, NA, NA, NA, NA
    ), nlfzt = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), verti = c(NA, NA, 
    NA, NA, NA, NA, NA), alpos = c(NA, NA, NA, NA, NA, NA, NA
    ), ewahr = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), ekgrp = c(NA, NA, 
    NA, NA, NA, NA, NA), lifzt = c(0L, 0L, 0L, 0L, 0L, 0L, 0L
    ), lifnr = c(NA, NA, NA, NA, NA, NA, NA), roms1 = c(0, 0, 
    0, 0, 0, 0, 0), roms2 = c(0, 0, 0, 0, 0, 0, 0), roms3 = c(0, 
    0, 0, 0, 0, 0, 0), romen = c(0, 0, 0, 0, 0, 0, 0), rform = c(NA, 
    NA, NA, NA, NA, NA, NA), upskz = c(NA, NA, NA, NA, NA, NA, 
    NA), valkz = c(NA, NA, NA, NA, NA, NA, NA), matkl = c(NA, 
    NA, NA, NA, NA, NA, NA), webaz = c(0L, 0L, 0L, 0L, 0L, 0L, 
    0L), clobk = c(NA, NA, NA, NA, NA, NA, NA), lgort = c(NA, 
    NA, NA, NA, NA, NA, 14L), kzkup = c(NA, NA, NA, NA, NA, NA, 
    NA), dvnam = c(NA, NA, NA, NA, NA, NA, NA), dspst = c(NA, 
    NA, NA, NA, NA, NA, NA), alpst = c(NA, NA, NA, NA, NA, NA, 
    NA), alprf = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), alpgr = c(NA, 
    NA, NA, NA, NA, NA, NA), kstty = c(NA, NA, NA, NA, NA, NA, 
    NA), kstnr = c(NA, NA, NA, NA, NA, NA, NA), nlfzv = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L), nlfmv = c(NA, NA, NA, NA, NA, NA, 
    NA), idhis = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), idvar = c(NA, 
    NA, NA, NA, NA, NA, NA), itsob = c(NA, NA, NA, NA, NA, NA, 
    NA), cufactor = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), funcid = c(NA, 
    NA, NA, NA, NA, NA, NA)), row.names = c(NA, -7L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000022534c51ef0>)

Upvotes: 1

Views: 83

Answers (2)

Satish Vadlamani
Satish Vadlamani

Reputation: 109

Looks like I have taken a cue from what Arun wrote and modified the code like so:

res2 <- map(stpo[,1:length(names(stpo))], test_unique)

Upvotes: 0

akrun
akrun

Reputation: 887971

The issue is that we are subsetting on a data.table, rather than a data.frame. Here, we need with = FALSE (as mentioned in ?data.table

j - When with=TRUE (default), j is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables.

stpo[,c(names(stpo))]
 [1] "stlnr"    "stlkn"    "stpoz"    "aennr"    "vgknt"    "idnrk"    "pswrk"    "meins"    "menge"    "fmeng"    "ausch"    "avoau"    "netau"    "erskz"   
[15] "rekri"    "rekrs"    "nlfzt"    "verti"    "alpos"    "ewahr"    "ekgrp"    "lifzt"    "lifnr"    "roms1"    "roms2"    "roms3"    "romen"    "rform"   
[29] "upskz"    "valkz"    "matkl"    "webaz"    "clobk"    "lgort"    "kzkup"    "dvnam"    "dspst"    "alpst"    "alprf"    "alpgr"    "kstty"    "kstnr"   
[43] "nlfzv"    "nlfmv"    "idhis"    "idvar"    "itsob"    "cufactor" "funcid"  

Now, check the output of

stpo[,c(names(stpo)), with = FALSE]
 stlnr stlkn stpoz aennr vgknt  idnrk pswrk meins menge fmeng ausch avoau netau erskz rekri rekrs nlfzt verti alpos ewahr ekgrp lifzt lifnr roms1 roms2
1:     1     1     2    NA     0 test_1    NA    EA     1    NA     0     0    NA    NA    NA    NA     0    NA    NA     0    NA     0    NA     0     0
2:     2     1     2    NA     0 test_1    NA    EA    14    NA     0     0    NA    NA    NA    NA     0    NA    NA     0    NA     0    NA     0     0
3:     3     1     2    NA     0 test_2    NA    EA     4    NA     0     0    NA    NA    NA    NA     0    NA    NA     0    NA     0    NA     0     0
4:     3     2     4    NA     0 test_3    NA    EA     4    NA     0     0    NA    NA    NA    NA     0    NA    NA     0    NA     0    NA     0     0
5:     3     3     6    NA     0 test_3    NA    EA     2    NA     0     0    NA    NA    NA    NA     0    NA    NA     0    NA     0    NA     0     0
6:     3     4     8    NA     0 test_1    NA    EA     2    NA     0     0    NA    NA    NA    NA     0    NA    NA     0    NA     0    NA     0     0
7:     4     5    10    NA     0 test_2    NA    EA     1    NA     0     0    NA    NA    NA    NA     0    NA    NA     0    NA     0    NA     0     0
   roms3 romen rform upskz valkz matkl webaz clobk lgort kzkup dvnam dspst alpst alprf alpgr kstty kstnr nlfzv nlfmv idhis idvar itsob cufactor funcid
1:     0     0    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA     0    NA     0    NA    NA        0     NA
2:     0     0    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA     0    NA     0    NA    NA        0     NA
3:     0     0    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA     0    NA     0    NA    NA        0     NA
4:     0     0    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA     0    NA     0    NA    NA        0     NA
5:     0     0    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA     0    NA     0    NA    NA        0     NA
6:     0     0    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA     0    NA     0    NA    NA        0     NA
7:     0     0    NA    NA    NA    NA     0    NA    14    NA    NA    NA    NA     0    NA    NA    NA     0    NA     0    NA 

Also, there is no need to do any subsetting if the whole columns are used, i.e. simply do

purrr::map(stpo, test_unique)

-output

$stlnr
[1] 4

$stlkn
[1] 5

$stpoz
[1] 5
...
...

Regarding the use of

stpo[,1:length(names(stpo))]

It seems to be a bug or a hackish way of dealing things instead of the standard option


If we want to eliminate columns having a single value, use var (assuming all numeric columns)

Filter(var, stpo)
stlnr stlkn stpoz menge
1:     1     1     2     1
2:     2     1     2    14
3:     3     1     2     4
4:     3     2     4     4
5:     3     3     6     2
6:     3     4     8     2
7:     4     5    10     1

Or change the function to return a logical output (it will also check for other type columns)

f1 <- function(x) length(unique(x)) > 1
Filter(f1, stpo)

-output

    stlnr stlkn stpoz  idnrk menge lgort
1:     1     1     2 test_1     1    NA
2:     2     1     2 test_1    14    NA
3:     3     1     2 test_2     4    NA
4:     3     2     4 test_3     4    NA
5:     3     3     6 test_3     2    NA
6:     3     4     8 test_1     2    NA
7:     4     5    10 test_2     1    14

Or use the data.table way of subsetting the columns

 stpo[, .SD, .SDcols = f1]
   stlnr stlkn stpoz  idnrk menge lgort
1:     1     1     2 test_1     1    NA
2:     2     1     2 test_1    14    NA
3:     3     1     2 test_2     4    NA
4:     3     2     4 test_3     4    NA
5:     3     3     6 test_3     2    NA
6:     3     4     8 test_1     2    NA
7:     4     5    10 test_2     1    14

Upvotes: 2

Related Questions