How to optimise these loops in R

Question

I'm in the process of cleaning data and have ended up with a lot of for loops. Since my data set has more than 6 million rows, this is a bit of a problem for me, but I'm not sure how to avoid it.

An example of my data set (called sentencing.df) would be something like:

    Ethnicity     PersonNumber

    Caucasian     1
    Caucasian     1
    Unknown       1
    Indian        2
    Indian        2

I want to compare within the same person number - for example, I want to know whether the ethnicities for each person number are the same (and then to change the incorrect entries if they exist). My code uses for loops and looks something like this:

PersonListRace <- unique(sentencing.df[sentencing.df$ethnicity == "UNKNOWN",]$PersonNumber) 
PersonListRace <- as.numeric(as.character(PersonListRace))
 # vector of person numbers for those with ethnicity UNKNOWN

for (i in 1:100) {
  race <- sentencing.df[sentencing.df$PersonNumber == PersonListRace[i],]$ethnicity
    # creates a vector of unique ethnicities for that person
  if (length(unique(race)) != 2) {next}
    # excludes those who only have UNKNOWN or who have UNKNOWN plus multiple ethnicities
  else {
   label <- as.character(unique(race[which(race != "UNKNOWN")]))
   sentencing.df[sentencing.df$PersonNumber == PersonListRace[i],]$ethnicity <- label
  }
}

I then have similar things for all my other variables, and the for loops take far too long to run. I've looked at some of the other questions and answers on the site, but my main problem is that I can't find a way to compare only within the same person number across a different variable, without using a for loop.

Anything that would help me achieve my aim in a practical timeframe would be very much appreciated :)

How to optimise these loops in R

Answers (1)

Related Questions