Reputation: 171
I am trying to clean up a dataset of crime data by giving it better categorical labels. A sample of the table looks like this:
d <- as.data.table(read.csv('[filepath]'))
print(d)
Classifications ucr_ncic_code
SOVEREIGNTY NA
Treason 101
Treason Misprison 102
Espionage 103
Sabotage 104
Sedition 105
Selective Service 106
Sovereignty 199
MILITARY NA
Military Desertion 201
Military 299
IMMIGRATION NA
Illegal Entry 301
False Citizenship 302
Smuggling Aliens 303
Immigration 399
HOMICIDE NA
Homicide Family-Gun 901
Homicide Family-Weapon 902
Homicide Nonfam-Gun 903
<TRUNCATED>
As you can see, in the original dataset the broader categories of crime classifications are formatted as all-caps headers with an NA code (e.g SOVEREIGNTY NA
). What I would like to do is reformat the data so that these headers are their own categorial column in the table.
Here is my initial solution, which I am almost sure is not the best approach, but produces the desired result:
d[,row.num := .I,]
d.categs <- d[toupper(substr(Classifications,1,3))==substr(Classifications,1,3)]
#the substring is for some edge cases that I don't show here
setnames(d.categs, "Classifications", "Category")
d <- merge(d,d.categs[,row.num,list(Category)],'row.num', all.x=TRUE)
d <- d[order(row.num)]
prev.row <- NA
for (i in seq(1,d[,.N])) {
current.row <- d$Category[i]
if (is.na(current.row) & !(is.na(prev.row))){
d$Category[i] <- prev.row
}
prev.row <- d$Category[i]
}
#clean up
d <- d[!(is.na(ucr_ncic_code))]
d[,row.num := NULL,]
print(d)
Classifications ucr_ncic_code Category
Treason 101 SOVEREIGNTY
Treason Misprison 102 SOVEREIGNTY
Espionage 103 SOVEREIGNTY
Sabotage 104 SOVEREIGNTY
Sedition 105 SOVEREIGNTY
Selective Service 106 SOVEREIGNTY
Sovereignty 199 SOVEREIGNTY
Military Desertion 201 MILITARY
Military 299 MILITARY
Illegal Entry 301 IMMIGRATION
False Citizenship 302 IMMIGRATION
Smuggling Aliens 303 IMMIGRATION
Immigration 399 IMMIGRATION
Homicide Family-Gun 901 HOMICIDE
Homicide Family-Weapon 902 HOMICIDE
Homicide Nonfam-Gun 903 HOMICIDE
<TRUNCATED>
What would be a better way to utilize the data.table package to make this formatting change? I'm guessing there's a better way to copy cells down than the for-loop that I designed.
Upvotes: 1
Views: 105
Reputation: 34763
I would use rleid
as follows:
DT[ , category := Classifications[1L],
by = ceiling(rleid(is.na(ucr_ncic_code)) / 2)]
The key is that is.na(ucr_ncic_code)
alternates between TRUE
and FALSE
within each block, so that every odd # rleid
is a section head and every even section head is a specific crime. We can pair consecutive integers by rounding up when dividing by 2 (so that, e.g., 3 & 4 are both counted as 2).
To delete the section headers, just use DT <- DT[!is.na(ucr_ncic_code)]
.
A check on success (inspired by the comment here) is to run the following pair of tests:
DT[ , .(test1 = all(category == toupper(category)),
test2 = any(Classifications == toupper(Classifications)))]
test1
makes sure all assigned category
s are indeed so (as enforced wit the rule that only categories are in all caps) and test2
makes sure there are no more objects in Classifications
which are in all caps.
Upvotes: 1