Reputation: 1
I'm in the process of cleaning data and have ended up with a lot of for
loops. Since my data set has more than 6 million rows, this is a bit of a problem for me, but I'm not sure how to avoid it.
An example of my data set (called sentencing.df) would be something like:
Ethnicity PersonNumber
Caucasian 1
Caucasian 1
Unknown 1
Indian 2
Indian 2
I want to compare within the same person number - for example, I want to know whether the ethnicities for each person number are the same (and then to change the incorrect entries if they exist). My code uses for loops and looks something like this:
PersonListRace <- unique(sentencing.df[sentencing.df$ethnicity == "UNKNOWN",]$PersonNumber)
PersonListRace <- as.numeric(as.character(PersonListRace))
# vector of person numbers for those with ethnicity UNKNOWN
for (i in 1:100) {
race <- sentencing.df[sentencing.df$PersonNumber == PersonListRace[i],]$ethnicity
# creates a vector of unique ethnicities for that person
if (length(unique(race)) != 2) {next}
# excludes those who only have UNKNOWN or who have UNKNOWN plus multiple ethnicities
else {
label <- as.character(unique(race[which(race != "UNKNOWN")]))
sentencing.df[sentencing.df$PersonNumber == PersonListRace[i],]$ethnicity <- label
}
}
I then have similar things for all my other variables, and the for loops take far too long to run. I've looked at some of the other questions and answers on the site, but my main problem is that I can't find a way to compare only within the same person number across a different variable, without using a for loop.
Anything that would help me achieve my aim in a practical timeframe would be very much appreciated :)
Upvotes: 0
Views: 65
Reputation: 263342
Neither of my concerns were addressed in the comment so I will just take the example as being fully representative of the complexity of the problem (although my experience is that things are rarely so simple);
dat <- read.table(text="Ethnicity PersonNumber
Caucasian 1
Caucasian 1
Unknown 1
Indian 2
Indian 2", header=TRUE)
dat$TrueEth <- with( dat, ave(Ethnicity, PersonNumber,
FUN=function(perE){
unique( perE[perE != "Unknown"] ) } ) )
> dat
Ethnicity PersonNumber TrueEth
1 Caucasian 1 Caucasian
2 Caucasian 1 Caucasian
3 Unknown 1 Caucasian
4 Indian 2 Indian
5 Indian 2 Indian
The outstanding issues are what to do with more than one value for Ethnicity and if the answer is majority rules what to do if there are an equal number of not-Unknown.
Upvotes: 1