Reputation: 33
Have following import:
d5_17cou <-
read.table("[enter link description here][1]c_all_d5_imp.dat",
header=TRUE, sep="\t", na.strings="", dec=",", row.names=1, comment.char="", strip.white=TRUE)
"row.names" are set to 1 in order to set the 1st column as row name.
I want to perform multiple correspondence analysis (MCA) using PCAmixdata.
I set required variables as factors, set various parameters:
d5_17cou <- within(d5_17cou, {
a025r <- as.factor(a025r)
a034r <- as.factor(a034r)
a038r <- as.factor(a038r)
a040r <- as.factor(a040r)
a041r <- as.factor(a041r)
a042r <- as.factor(a042r)
c001r <- as.factor(c001r)
c024r <- as.factor(c024r)
c037r <- as.factor(c037r)
charity <- as.factor(charity)
clz.outgr4 <- as.factor(clz.outgr4)
d019r <- as.factor(d019r)
d023r <- as.factor(d023r)
e014r <- as.factor(e014r)
e018r <- as.factor(e018r)
e035r <- as.factor(e035r)
e114r <- as.factor(e114r)
e143r <- as.factor(e143r)
e146r <- as.factor(e146r)
e190rr <- as.factor(e190rr)
f022r <- as.factor(f022r)
f028r <- as.factor(f028r)
f051r <- as.factor(f051r)
f064r <- as.factor(f064r)
f066r <- as.factor(f066r)
f121r <- as.factor(f121r)
helpef <- as.factor(helpef)
jpay <- as.factor(jpay)
prices1 <- as.factor(prices1)
psub.all <- as.factor(psub.all)
})
weight.row <- d5_17cou[,c(4)]
X.quali <- d5_17cou[,c(7:36)]
Then comes the MCA command line:
mca <- PCAmix(X.quanti=NULL,X.quali,ndim=5,weight.col=NULL,weight.row,graph=FALSE)
Followed by the error message: duplicate 'row.names' are not allowed.
Which is strange, given that the code used to work a couple of years ago on the exact same data. Not this time.
Have browsed most of the archive here for "duplicate row.names" error, tried a lot of the solutions there, and would still get the same error. This is to say that a "try looking up this or that thread" kind of advice probably will not help -- what I need is more specific.
Even more bizarre, after adding the
row.names=1
subcommand to read.table this afternoon, it worked OK -- but not in the evening when I returned to the task, using the exact same script.
Data in question attached.
Data file [Google Drive] Thanks in advance.
Upvotes: 2
Views: 422
Reputation: 42592
I believe the problem is caused by the fact that the row names are numbers, e.g., 199990901000,
which are larger than the greatest integer value .Machine$integer.max
which is 2147483647
. While row names of a data.frame are of type character it might cause a problem in later processing steps, perhaps.
Therefore, I suggest to treat the first column as a regular data column and not as row.names.
The code below worked for me to read the file and to coerce many columns to factor:
library(data.table)
url <- sprintf("https://docs.google.com/uc?id=%s&export=download",
"1NwcvwwaPLWaSmKOuQiVrWAK4iKn9f10S")
d5_17cou <- fread(url, dec = ",", colClasses = list(character = 1L))
cols <- names(d5_17cou)[8:37]
d5_17cou[, (cols) := lapply(.SD, as.factor), .SDcols = cols]
str(d5_17cou)
Classes ‘data.table’ and 'data.frame': 22431 obs. of 39 variables: $ S007 : chr "199905600001" "199905600002" "199905600003" "199905600004" ... $ S003A : int 56 56 56 56 56 56 56 56 56 56 ... $ cou.year : int 561999 561999 561999 561999 561999 561999 561999 561999 561999 561999 ... $ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ... $ s017ay : num 0.692 1.051 1.051 0.752 0.752 ... $ uitem : int 1 2 3 4 5 6 7 8 9 10 ... $ item : int 1 2 3 4 5 6 7 8 9 10 ... $ a025r : Factor w/ 2 levels "1","3": 2 2 2 2 1 1 1 2 1 2 ... $ a034r : Factor w/ 2 levels "1","3": 1 2 1 1 1 2 1 2 1 1 ... $ a038r : Factor w/ 2 levels "1","3": 1 2 2 2 2 1 2 1 2 1 ... $ a040r : Factor w/ 2 levels "1","3": 1 1 1 1 1 1 1 1 1 1 ... $ a041r : Factor w/ 2 levels "1","3": 1 1 1 1 1 1 1 1 1 1 ... $ a042r : Factor w/ 2 levels "1","3": 2 1 2 2 1 1 1 1 1 1 ... $ c001r : Factor w/ 2 levels "1","3": 1 1 2 1 1 1 1 1 1 1 ... $ c024r : Factor w/ 2 levels "1","3": 2 2 2 2 1 2 2 1 2 2 ... $ c037r : Factor w/ 2 levels "1","3": 1 1 1 1 2 2 1 2 2 1 ... $ charity : Factor w/ 2 levels "1","3": 1 1 2 1 1 1 1 2 2 2 ... $ clz.outgr4: Factor w/ 2 levels "1","3": 2 1 1 1 1 1 1 2 1 1 ... $ d019r : Factor w/ 2 levels "1","3": 2 2 2 2 2 2 2 2 2 2 ... $ d023r : Factor w/ 2 levels "1","3": 2 2 1 1 2 2 2 2 2 2 ... $ e014r : Factor w/ 2 levels "1","3": 1 2 1 1 2 2 2 2 2 2 ... $ e018r : Factor w/ 2 levels "1","3": 2 2 2 2 1 1 1 2 2 1 ... $ e035r : Factor w/ 2 levels "1","3": 2 2 1 2 2 1 1 2 2 2 ... $ e114r : Factor w/ 2 levels "1","3": 2 1 1 1 1 1 1 1 1 1 ... $ e143r : Factor w/ 2 levels "1","3": 2 1 1 2 2 1 1 1 1 2 ... $ e146r : Factor w/ 2 levels "1","3": 1 1 2 1 1 2 1 1 2 1 ... $ e190rr : Factor w/ 2 levels "1","3": 2 1 2 2 2 1 2 2 2 2 ... $ f022r : Factor w/ 2 levels "1","3": 1 1 1 2 1 1 1 1 1 1 ... $ f028r : Factor w/ 2 levels "1","3": 1 2 2 2 2 1 2 1 1 1 ... $ f051r : Factor w/ 2 levels "1","3": 1 2 2 2 2 1 1 2 1 1 ... $ f064r : Factor w/ 2 levels "1","3": 1 2 2 2 1 1 2 2 2 1 ... $ f066r : Factor w/ 2 levels "1","3": 1 2 2 2 2 1 2 2 2 2 ... $ f121r : Factor w/ 2 levels "1","3": 1 2 2 1 1 2 2 1 2 2 ... $ helpef : Factor w/ 2 levels "1","3": 1 2 2 2 2 2 2 2 2 2 ... $ jpay : Factor w/ 2 levels "1","3": 1 1 1 1 1 1 1 1 1 1 ... $ prices1 : Factor w/ 2 levels "1","3": 1 1 1 1 1 1 1 1 1 1 ... $ psub.all : Factor w/ 2 levels "1","3": 2 1 1 1 1 2 1 2 2 1 ... $ oriend : int 1 1 1 1 1 1 1 1 1 1 ... $ dupl : int 0 0 0 0 0 0 0 0 0 0 ... - attr(*, ".internal.selfref")=<externalptr>
Note that the first column S007
is explicitely read in as character column (otherwise fread()
uses int64
) and is part of the dataset, now. Consequently, the numbering of all subsequent columns is changed.
BTW, fread()
is much faster than read.table()
.
Upvotes: 1