giuliot
giuliot

Reputation: 163

How to extract the codes from a factor

First of all I would like to apologise in adavnce if I won't be very clear in my question. I'm totally new to R and my terminology won't be that good.

We get a SPSS file from an external company that contains survey data. We have an R script to extract the data and write it into a CSV file. This works fine.

The second part of the script build a INI-style file for all the possible aswers. As an example, for the AGE we would have something like

[ AGE ]
1 = Under 13
2 = 13 - 15
3 = 15 - 25
4 = 25+

The CSV file will have one of 1, 2, 3 or 4 for each line. Until recently all possible answers were numbered starting with 1, but now some of them start from 0. Therefore we would like to have something like:

[ AGE ]
0 = Under 13
1 = 13 - 15
2 = 15 - 25
3 = 25+

The following is the current R code that we use. I know where it goes wrong, but I don't know how to correct it.

data<-read.spss(inputFile, to.data.frame=TRUE);
fileOut<- file(valuesExportFile, "w");
for (name in names(data)) {
  cat("[", name,"]\n", file=fileOut);
  variableValues<-levels(data[[name]]);
  numberOfValues<-nlevels(data[[name]]);
  if (numberOfValues > 0) {
     for (i in 1:numberOfValues) {
         cat(i, '= "', variableValues[i], '"', "\n", file=fileOut);
     }
  }
};
close(fileOut);

I have spent a day and a half googling and trying various approach. I did find a perl script, spssread.pl, that extract the data as we want it, but for some reason all the labels names are in uppercase, which is not acceptable as they are case-sensitive. I will keep looking at this script, but in the meantime I would like to see if there is a solution using R, since this is what we use already and it would be nice to have everything in one script.

So, any suggestions?

Upvotes: 3

Views: 1411

Answers (1)

giuliot
giuliot

Reputation: 163

Thanks to Brian Diggs I was able to explore another way and I find a solution, although not a perfect one.

My solution was to extract the data with the use.value.labels=FALSE and then unclass the variable and use the value.labels attribute. I think showing the code would be clearer than me trying to explain it.

data<-read.spss(inputFile, to.data.frame=TRUE, use.value.labels=FALSE);
fileOut<- file(valuesExportFile, "w");
for (name in names(data)) {
    cat("[", name,"]\n", file=fileOut);
    variables<-attr(unclass(data[[name]]), "value.labels");
    for (label in names(variables)) {
        cat(variables[[label]], '= "', label, '"', "\n", file=fileOut);
    }
};
close(fileOut);

The result

[ AGE ]
8 = " 65+ "
7 = " 55 to 64 "
6 = " 45 to 54 "
5 = " 35 to 44 "
4 = " 25 to 34 "
3 = " 21 to 24 "
2 = " 16 to 20 "
1 = " 13 to 15 "
0 = " Under 13 "

although workable, is not ideal. Does anyone know how I could sort them so to have

[ AGE ]
0 = " Under 13 "
1 = " 13 to 15 "
2 = " 16 to 20 "
3 = " 21 to 24 "
4 = " 25 to 34 "
5 = " 35 to 44 "
6 = " 45 to 54 "
7 = " 55 to 64 "
8 = " 65+ "

EDIT: 04/05/12

After some more help from Brian Diggs (see the comments) the final solutions is

data<-read.spss(inputFile, to.data.frame=TRUE, use.value.labels=FALSE);
fileOut<- file(valuesExportFile, "w");
for (name in names(data)) {
    cat("[", name,"]\n", file=fileOut);
    variables<-attr(unclass(data[[name]]), "value.labels");
    variables<-variables[order(as.numeric(variables))];
    for (label in names(variables)) {
        cat(variables[[label]], '= "', label, '"', "\n", file=fileOut);
    }
};
close(fileOut);

Upvotes: 2

Related Questions