Jochem
Jochem

Reputation: 3386

Creating a unique integer on the basis of a string

I have a larger dataset (data.table with approx 9m rows) with a column that I would like to use to aggregate values (min and max etc). The column is a combination of various other columns and has a string based format, like the one below:

string <- "318XXXX | VNSGN | BIER"

To gain some speed in performing tasks, I would like to recode this to a unique integer. Another application that I use on a regular basis to deal with data has a build-in function that transforms a string as the one above in a integer (e.g. 73823). I was wondering whether there is a similar function in R? The idea is that a particular string will always result in the same integer; this will allow it to be used in merging data.tables etc.

Here a little example of the data.table column that I would like to encode in simple integer values:

sample <- c("318XXXX | VNSGN | BIER", "462XXXX | TZZZH | 9905", "462XXXX | TZZZH | 9905", 
"462XXXX | TZZZH | 9905", "511XXXX | FAWOR | 336H", "511XXXX | FAWOR | 336H", 
"652XXXX | XXXXR | T136", "652XXXX | XXXXR | T136", "672XXXX | BQQSZ | 7777", 
"672XXXX | BQQSZ | 7777")

I am hoping to encode the strings into an additional column to the table like the one below; note that the same strings result in the same numbers.

String                      Number
318XXXX | VNSGN | BIER      19872
462XXXX | TZZZH | 9905      78392
462XXXX | TZZZH | 9905      78392
462XXXX | TZZZH | 9905      78392
511XXXX | FAWOR | 336H      23053
511XXXX | FAWOR | 336H      23053
652XXXX | XXXXR | T136      95832
652XXXX | XXXXR | T136      95832
672XXXX | BQQSZ | 7777      71829   
672XXXX | BQQSZ | 7777      71829

Upvotes: 3

Views: 246

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269852

The data.table package will create indexes for you without making you handle them explicitly so it would be less work than the approach in the question. See the setkey function in data.table.

Also the sqldf package can use the SQL create index statement as per Examples 4h and 4i on the sqldf home page as can just about any database package.

Upvotes: 6

Related Questions