Reputation: 512
I have a data.table with a column that lists the harmonized tariff codes for the goods that are being shipped. There are some input issues because sometimes a row may have repeated numbers "7601.00; 7601.00" and sometimes it might have different numbers, "7601.00; 8800.00". I have not decided what to do when I have differing entries, but the first thing I want to do is get rid of the duplicates. So I wrote a custom user defined function:
unique_hscodes <- function(hs_input){
new <- strsplit(hs_input, split = ";") # Delimiter ;
new <- lapply(new, str_replace_all, " ", "")
if (length(unique(unlist(new))) == 1) { # Unique HS code
return(unique(unlist(new)))
}
else {
new <- names(sort(table(unlist(new)),decreasing=TRUE)[1]) # Most frequent
return(new)
}
}
When I do, DT[, hs_code := unique_hscodes(hscode)]
it returns me a data table with a column hs_code with the same number. But when I do DT[, hs_code := unique_hscodes(hscode), by =1:nrow(DT)]
, it is done properly.
Can someone please explain what is going on here?
Upvotes: 1
Views: 67
Reputation: 263301
Your code returns multiple items from a single item input after the string split. When you run it with by=1:nrow(DT) only a single row is examined at once. That problem doesn't arise when only a single row is presented.
DT <- data.table(hscode=c("7601.00; 7601.00" , "7601.00; 8800.00"))
DT
#-----
hscode
1: 7601.00; 7601.00
2: 7601.00; 8800.00
#--
DT[ , table( unlist( strsplit(hscode, split="; "))) ]
#7601.00 8800.00
# 3 1
DT[ , table( unlist( strsplit(hscode, split="; "))) , by=1:nrow(DT)]
#---------
nrow V1
1: 1 2
2: 2 1
3: 2 1
I tried @Jaap's code with the simple example but it only split the column in two:
> DT[, hs_code := sapply(hscode, unique_hscodes)]
> DT
hscode hs_code
1: 7601.00; 7601.00 7601.00
2: 7601.00; 8800.00 7601.00
Upvotes: 1