grrothman
grrothman

Reputation: 171

R data.table: set row equal to previous row within range

I am trying to clean up a dataset of crime data by giving it better categorical labels. A sample of the table looks like this:

d <- as.data.table(read.csv('[filepath]'))
print(d)

Classifications ucr_ncic_code
SOVEREIGNTY            NA
Treason                101
Treason Misprison      102
Espionage              103
Sabotage               104
Sedition               105
Selective Service      106
Sovereignty            199
MILITARY               NA
Military Desertion     201
Military               299 
IMMIGRATION            NA
Illegal Entry          301
False Citizenship      302
Smuggling Aliens       303
Immigration            399
HOMICIDE               NA
Homicide Family-Gun    901
Homicide Family-Weapon 902
Homicide Nonfam-Gun    903
<TRUNCATED>

As you can see, in the original dataset the broader categories of crime classifications are formatted as all-caps headers with an NA code (e.g SOVEREIGNTY NA). What I would like to do is reformat the data so that these headers are their own categorial column in the table.

Here is my initial solution, which I am almost sure is not the best approach, but produces the desired result:

d[,row.num := .I,]
d.categs <- d[toupper(substr(Classifications,1,3))==substr(Classifications,1,3)] 
#the substring is for some edge cases that I don't show here

setnames(d.categs, "Classifications", "Category")
d <- merge(d,d.categs[,row.num,list(Category)],'row.num', all.x=TRUE)
d <- d[order(row.num)]

prev.row <- NA
for (i in seq(1,d[,.N])) {
  current.row <- d$Category[i]  
  if (is.na(current.row) & !(is.na(prev.row))){
    d$Category[i] <- prev.row
  } 
  prev.row <- d$Category[i]
}

#clean up
d <- d[!(is.na(ucr_ncic_code))]
d[,row.num := NULL,]

print(d)

Classifications   ucr_ncic_code   Category
Treason                 101       SOVEREIGNTY
Treason Misprison       102       SOVEREIGNTY
Espionage               103       SOVEREIGNTY
Sabotage                104       SOVEREIGNTY
Sedition                105       SOVEREIGNTY
Selective Service       106       SOVEREIGNTY
Sovereignty             199       SOVEREIGNTY
Military Desertion      201       MILITARY
Military                299       MILITARY
Illegal Entry           301       IMMIGRATION
False Citizenship       302       IMMIGRATION
Smuggling Aliens        303       IMMIGRATION
Immigration             399       IMMIGRATION
Homicide Family-Gun     901       HOMICIDE
Homicide Family-Weapon  902       HOMICIDE
Homicide Nonfam-Gun     903       HOMICIDE
<TRUNCATED>

What would be a better way to utilize the data.table package to make this formatting change? I'm guessing there's a better way to copy cells down than the for-loop that I designed.

Upvotes: 1

Views: 105

Answers (1)

MichaelChirico
MichaelChirico

Reputation: 34763

I would use rleid as follows:

DT[ , category := Classifications[1L],
    by = ceiling(rleid(is.na(ucr_ncic_code)) / 2)]

The key is that is.na(ucr_ncic_code) alternates between TRUE and FALSE within each block, so that every odd # rleid is a section head and every even section head is a specific crime. We can pair consecutive integers by rounding up when dividing by 2 (so that, e.g., 3 & 4 are both counted as 2).

To delete the section headers, just use DT <- DT[!is.na(ucr_ncic_code)].

A check on success (inspired by the comment here) is to run the following pair of tests:

DT[ , .(test1 = all(category == toupper(category)),
        test2 = any(Classifications == toupper(Classifications)))]

test1 makes sure all assigned categorys are indeed so (as enforced wit the rule that only categories are in all caps) and test2 makes sure there are no more objects in Classifications which are in all caps.

Upvotes: 1

Related Questions