Reputation: 109
Problem statement: I actually want to eliminate from further analysis columns that have identical values in all cells. In order to do this, I want to find the columns that have identical values.
I wrote the following code which seems to be working for the dataframe test but not for the real dataframe stpo
library("dplyr")
library("purrr")
test_unique <- function(x)
{
return(length(unique(x)))
}
test <-data.frame(c1 = c("a", "a"), c2 = c(NA, NA), c3 = c(1,2), c4=c(NA, 4))
# What I want to find out the columns that have the same value throughout
res <- map(test[,c(names(test))], test_unique)
res
# But when I try to apply the same thing to the dataset below, it does not work.
# Not sure what the reason is. Is there a better way to do this? Perhaps using data.table? What am I doing wrong?
res2 <- map(stpo[,c(names(stpo))], test_unique)
res2
I am not exactly sure how to put the result of dput. I am putting this below (this is the dataframe stpo)
structure(list(stlnr = c(1L, 2L, 3L, 3L, 3L, 3L, 4L), stlkn = c(1L,
1L, 1L, 2L, 3L, 4L, 5L), stpoz = c(2L, 2L, 2L, 4L, 6L, 8L, 10L
), aennr = c(NA, NA, NA, NA, NA, NA, NA), vgknt = c(0L, 0L, 0L,
0L, 0L, 0L, 0L), idnrk = c("test_1", "test_1", "test_2", "test_3",
"test_3", "test_1", "test_2"), pswrk = c(NA, NA, NA, NA, NA,
NA, NA), meins = c("EA", "EA", "EA", "EA", "EA", "EA", "EA"),
menge = c(1, 14, 4, 4, 2, 2, 1), fmeng = c(NA, NA, NA, NA,
NA, NA, NA), ausch = c(0, 0, 0, 0, 0, 0, 0), avoau = c(0,
0, 0, 0, 0, 0, 0), netau = c(NA, NA, NA, NA, NA, NA, NA),
erskz = c(NA, NA, NA, NA, NA, NA, NA), rekri = c(NA, NA,
NA, NA, NA, NA, NA), rekrs = c(NA, NA, NA, NA, NA, NA, NA
), nlfzt = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), verti = c(NA, NA,
NA, NA, NA, NA, NA), alpos = c(NA, NA, NA, NA, NA, NA, NA
), ewahr = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), ekgrp = c(NA, NA,
NA, NA, NA, NA, NA), lifzt = c(0L, 0L, 0L, 0L, 0L, 0L, 0L
), lifnr = c(NA, NA, NA, NA, NA, NA, NA), roms1 = c(0, 0,
0, 0, 0, 0, 0), roms2 = c(0, 0, 0, 0, 0, 0, 0), roms3 = c(0,
0, 0, 0, 0, 0, 0), romen = c(0, 0, 0, 0, 0, 0, 0), rform = c(NA,
NA, NA, NA, NA, NA, NA), upskz = c(NA, NA, NA, NA, NA, NA,
NA), valkz = c(NA, NA, NA, NA, NA, NA, NA), matkl = c(NA,
NA, NA, NA, NA, NA, NA), webaz = c(0L, 0L, 0L, 0L, 0L, 0L,
0L), clobk = c(NA, NA, NA, NA, NA, NA, NA), lgort = c(NA,
NA, NA, NA, NA, NA, 14L), kzkup = c(NA, NA, NA, NA, NA, NA,
NA), dvnam = c(NA, NA, NA, NA, NA, NA, NA), dspst = c(NA,
NA, NA, NA, NA, NA, NA), alpst = c(NA, NA, NA, NA, NA, NA,
NA), alprf = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), alpgr = c(NA,
NA, NA, NA, NA, NA, NA), kstty = c(NA, NA, NA, NA, NA, NA,
NA), kstnr = c(NA, NA, NA, NA, NA, NA, NA), nlfzv = c(0L,
0L, 0L, 0L, 0L, 0L, 0L), nlfmv = c(NA, NA, NA, NA, NA, NA,
NA), idhis = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), idvar = c(NA,
NA, NA, NA, NA, NA, NA), itsob = c(NA, NA, NA, NA, NA, NA,
NA), cufactor = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), funcid = c(NA,
NA, NA, NA, NA, NA, NA)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000022534c51ef0>)
Upvotes: 1
Views: 83
Reputation: 109
Looks like I have taken a cue from what Arun wrote and modified the code like so:
res2 <- map(stpo[,1:length(names(stpo))], test_unique)
Upvotes: 0
Reputation: 887971
The issue is that we are subsetting on a data.table
, rather than a data.frame
. Here, we need with = FALSE
(as mentioned in ?data.table
j - When with=TRUE (default), j is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables.
stpo[,c(names(stpo))]
[1] "stlnr" "stlkn" "stpoz" "aennr" "vgknt" "idnrk" "pswrk" "meins" "menge" "fmeng" "ausch" "avoau" "netau" "erskz"
[15] "rekri" "rekrs" "nlfzt" "verti" "alpos" "ewahr" "ekgrp" "lifzt" "lifnr" "roms1" "roms2" "roms3" "romen" "rform"
[29] "upskz" "valkz" "matkl" "webaz" "clobk" "lgort" "kzkup" "dvnam" "dspst" "alpst" "alprf" "alpgr" "kstty" "kstnr"
[43] "nlfzv" "nlfmv" "idhis" "idvar" "itsob" "cufactor" "funcid"
Now, check the output of
stpo[,c(names(stpo)), with = FALSE]
stlnr stlkn stpoz aennr vgknt idnrk pswrk meins menge fmeng ausch avoau netau erskz rekri rekrs nlfzt verti alpos ewahr ekgrp lifzt lifnr roms1 roms2
1: 1 1 2 NA 0 test_1 NA EA 1 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
2: 2 1 2 NA 0 test_1 NA EA 14 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
3: 3 1 2 NA 0 test_2 NA EA 4 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
4: 3 2 4 NA 0 test_3 NA EA 4 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
5: 3 3 6 NA 0 test_3 NA EA 2 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
6: 3 4 8 NA 0 test_1 NA EA 2 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
7: 4 5 10 NA 0 test_2 NA EA 1 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
roms3 romen rform upskz valkz matkl webaz clobk lgort kzkup dvnam dspst alpst alprf alpgr kstty kstnr nlfzv nlfmv idhis idvar itsob cufactor funcid
1: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
2: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
3: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
4: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
5: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
6: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
7: 0 0 NA NA NA NA 0 NA 14 NA NA NA NA 0 NA NA NA 0 NA 0 NA
Also, there is no need to do any subsetting if the whole columns are used, i.e. simply do
purrr::map(stpo, test_unique)
-output
$stlnr
[1] 4
$stlkn
[1] 5
$stpoz
[1] 5
...
...
Regarding the use of
stpo[,1:length(names(stpo))]
It seems to be a bug or a hackish way of dealing things instead of the standard option
If we want to eliminate columns having a single value, use var
(assuming all numeric columns)
Filter(var, stpo)
stlnr stlkn stpoz menge
1: 1 1 2 1
2: 2 1 2 14
3: 3 1 2 4
4: 3 2 4 4
5: 3 3 6 2
6: 3 4 8 2
7: 4 5 10 1
Or change the function to return a logical output (it will also check for other type columns)
f1 <- function(x) length(unique(x)) > 1
Filter(f1, stpo)
-output
stlnr stlkn stpoz idnrk menge lgort
1: 1 1 2 test_1 1 NA
2: 2 1 2 test_1 14 NA
3: 3 1 2 test_2 4 NA
4: 3 2 4 test_3 4 NA
5: 3 3 6 test_3 2 NA
6: 3 4 8 test_1 2 NA
7: 4 5 10 test_2 1 14
Or use the data.table
way of subsetting the columns
stpo[, .SD, .SDcols = f1]
stlnr stlkn stpoz idnrk menge lgort
1: 1 1 2 test_1 1 NA
2: 2 1 2 test_1 14 NA
3: 3 1 2 test_2 4 NA
4: 3 2 4 test_3 4 NA
5: 3 3 6 test_3 2 NA
6: 3 4 8 test_1 2 NA
7: 4 5 10 test_2 1 14
Upvotes: 2