Zoltan Navy
Zoltan Navy

Reputation: 33

duplicate 'row.names' are not allowed -- still killing me

Have following import:

d5_17cou <- 
  read.table("[enter link description here][1]c_all_d5_imp.dat",
   header=TRUE, sep="\t", na.strings="", dec=",", row.names=1, comment.char="",  strip.white=TRUE)

"row.names" are set to 1 in order to set the 1st column as row name.

I want to perform multiple correspondence analysis (MCA) using PCAmixdata.

I set required variables as factors, set various parameters:

d5_17cou <- within(d5_17cou, {
  a025r <- as.factor(a025r)
  a034r <- as.factor(a034r)
  a038r <- as.factor(a038r)
  a040r <- as.factor(a040r)
  a041r <- as.factor(a041r)
  a042r <- as.factor(a042r)
  c001r <- as.factor(c001r)
  c024r <- as.factor(c024r)
  c037r <- as.factor(c037r)
  charity <- as.factor(charity)
  clz.outgr4 <- as.factor(clz.outgr4)
  d019r <- as.factor(d019r)
  d023r <- as.factor(d023r)
  e014r <- as.factor(e014r)
  e018r <- as.factor(e018r)
  e035r <- as.factor(e035r)
  e114r <- as.factor(e114r)
  e143r <- as.factor(e143r)
  e146r <- as.factor(e146r)
  e190rr <- as.factor(e190rr)
  f022r <- as.factor(f022r)
  f028r <- as.factor(f028r)
  f051r <- as.factor(f051r)
  f064r <- as.factor(f064r)
  f066r <- as.factor(f066r)
  f121r <- as.factor(f121r)
  helpef <- as.factor(helpef)
  jpay <- as.factor(jpay)
  prices1 <- as.factor(prices1)
  psub.all <- as.factor(psub.all)
})
weight.row <- d5_17cou[,c(4)]
X.quali <- d5_17cou[,c(7:36)]

Then comes the MCA command line:

mca <- PCAmix(X.quanti=NULL,X.quali,ndim=5,weight.col=NULL,weight.row,graph=FALSE)

Followed by the error message: duplicate 'row.names' are not allowed.

Which is strange, given that the code used to work a couple of years ago on the exact same data. Not this time.

Have browsed most of the archive here for "duplicate row.names" error, tried a lot of the solutions there, and would still get the same error. This is to say that a "try looking up this or that thread" kind of advice probably will not help -- what I need is more specific.

Even more bizarre, after adding the

row.names=1

subcommand to read.table this afternoon, it worked OK -- but not in the evening when I returned to the task, using the exact same script.

Data in question attached.

Data file [Google Drive] Thanks in advance.

Upvotes: 2

Views: 422

Answers (1)

Uwe
Uwe

Reputation: 42592

I believe the problem is caused by the fact that the row names are numbers, e.g., 199990901000, which are larger than the greatest integer value .Machine$integer.max which is 2147483647. While row names of a data.frame are of type character it might cause a problem in later processing steps, perhaps.

Therefore, I suggest to treat the first column as a regular data column and not as row.names.

The code below worked for me to read the file and to coerce many columns to factor:

library(data.table)
url <- sprintf("https://docs.google.com/uc?id=%s&export=download", 
               "1NwcvwwaPLWaSmKOuQiVrWAK4iKn9f10S")
d5_17cou  <- fread(url, dec = ",", colClasses = list(character = 1L))
cols <- names(d5_17cou)[8:37]
d5_17cou[, (cols) := lapply(.SD, as.factor), .SDcols = cols]
str(d5_17cou)
Classes ‘data.table’ and 'data.frame':    22431 obs. of  39 variables:
 $ S007      : chr  "199905600001" "199905600002" "199905600003" "199905600004" ...
 $ S003A     : int  56 56 56 56 56 56 56 56 56 56 ...
 $ cou.year  : int  561999 561999 561999 561999 561999 561999 561999 561999 561999 561999 ...
 $ year      : int  1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
 $ s017ay    : num  0.692 1.051 1.051 0.752 0.752 ...
 $ uitem     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ item      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ a025r     : Factor w/ 2 levels "1","3": 2 2 2 2 1 1 1 2 1 2 ...
 $ a034r     : Factor w/ 2 levels "1","3": 1 2 1 1 1 2 1 2 1 1 ...
 $ a038r     : Factor w/ 2 levels "1","3": 1 2 2 2 2 1 2 1 2 1 ...
 $ a040r     : Factor w/ 2 levels "1","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ a041r     : Factor w/ 2 levels "1","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ a042r     : Factor w/ 2 levels "1","3": 2 1 2 2 1 1 1 1 1 1 ...
 $ c001r     : Factor w/ 2 levels "1","3": 1 1 2 1 1 1 1 1 1 1 ...
 $ c024r     : Factor w/ 2 levels "1","3": 2 2 2 2 1 2 2 1 2 2 ...
 $ c037r     : Factor w/ 2 levels "1","3": 1 1 1 1 2 2 1 2 2 1 ...
 $ charity   : Factor w/ 2 levels "1","3": 1 1 2 1 1 1 1 2 2 2 ...
 $ clz.outgr4: Factor w/ 2 levels "1","3": 2 1 1 1 1 1 1 2 1 1 ...
 $ d019r     : Factor w/ 2 levels "1","3": 2 2 2 2 2 2 2 2 2 2 ...
 $ d023r     : Factor w/ 2 levels "1","3": 2 2 1 1 2 2 2 2 2 2 ...
 $ e014r     : Factor w/ 2 levels "1","3": 1 2 1 1 2 2 2 2 2 2 ...
 $ e018r     : Factor w/ 2 levels "1","3": 2 2 2 2 1 1 1 2 2 1 ...
 $ e035r     : Factor w/ 2 levels "1","3": 2 2 1 2 2 1 1 2 2 2 ...
 $ e114r     : Factor w/ 2 levels "1","3": 2 1 1 1 1 1 1 1 1 1 ...
 $ e143r     : Factor w/ 2 levels "1","3": 2 1 1 2 2 1 1 1 1 2 ...
 $ e146r     : Factor w/ 2 levels "1","3": 1 1 2 1 1 2 1 1 2 1 ...
 $ e190rr    : Factor w/ 2 levels "1","3": 2 1 2 2 2 1 2 2 2 2 ...
 $ f022r     : Factor w/ 2 levels "1","3": 1 1 1 2 1 1 1 1 1 1 ...
 $ f028r     : Factor w/ 2 levels "1","3": 1 2 2 2 2 1 2 1 1 1 ...
 $ f051r     : Factor w/ 2 levels "1","3": 1 2 2 2 2 1 1 2 1 1 ...
 $ f064r     : Factor w/ 2 levels "1","3": 1 2 2 2 1 1 2 2 2 1 ...
 $ f066r     : Factor w/ 2 levels "1","3": 1 2 2 2 2 1 2 2 2 2 ...
 $ f121r     : Factor w/ 2 levels "1","3": 1 2 2 1 1 2 2 1 2 2 ...
 $ helpef    : Factor w/ 2 levels "1","3": 1 2 2 2 2 2 2 2 2 2 ...
 $ jpay      : Factor w/ 2 levels "1","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ prices1   : Factor w/ 2 levels "1","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ psub.all  : Factor w/ 2 levels "1","3": 2 1 1 1 1 2 1 2 2 1 ...
 $ oriend    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ dupl      : int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr>

Note that the first column S007 is explicitely read in as character column (otherwise fread() uses int64) and is part of the dataset, now. Consequently, the numbering of all subsequent columns is changed.

BTW, fread() is much faster than read.table().

Upvotes: 1

Related Questions