kikulikov
kikulikov

Reputation: 2583

data.table conditional search

How to search in data.table by condition?

I have a table for, let's say, users which contains user ip + a table of ip ranges from http://dev.maxmind.com/geoip/legacy/csv/. I would like to get a country for each user. The problem is that in order to find a country you need to split the IP to tokens and apply some formula to get an integer value representing the address. Then you need to find a row where that value fits in range.

I managed to do that with data.frame's but it is very slow..

extract.country <- function(code) {
  geo[(geo['V3']<=code & code<=geo['V4'])][6]
}

There are 93100 ip addresses ranges and relatively same number of users. But it seems to take about

   user  system elapsed 
   17.109   1.144  20.649 

to process 100 users on my machine. Which will be about 5 hours to process all of them.

That is how my & geoIP datasets are look like:

> head(dat)
                   V2            V26
1 2014-03-01 14:06:59  86.183.184.19
2 2014-03-01 23:50:02  86.112.53.139
3 2014-03-01 15:07:02    5.69.149.65
4 2014-03-01 14:27:18 218.186.19.230
5 2014-03-01 13:08:31   86.0.151.153
6 2014-03-01 23:18:00    79.148.42.6
> 
> head(geo)
        V1         V2       V3       V4 V5        V6
1  1.0.0.0  1.0.0.255 16777216 16777471 AU Australia
2  1.0.1.0  1.0.3.255 16777472 16778239 CN     China
3  1.0.4.0  1.0.7.255 16778240 16779263 AU Australia
4  1.0.8.0 1.0.15.255 16779264 16781311 CN     China
5 1.0.16.0 1.0.31.255 16781312 16785407 JP     Japan
6 1.0.32.0 1.0.63.255 16785408 16793599 CN     China

Where for each IP I calculate an int value, like

[1] 1454880787 1450194315   88446273 3669627878 1442879385 1335110150

I wanted to improve the performance with a data.table as columns might be indexed and performance is really good for exact search. But I can't figure out how to make a conditional search with that.

Or, is there any other way to do that with R?

Upvotes: 1

Views: 651

Answers (2)

Colonel Beauvel
Colonel Beauvel

Reputation: 31161

This should be what you're looking for:

dt<-data.table(num=c(1,2,3,1,3,4,6,7,7,7,8,3), let=(rep(c("A","B","C"), each=4)))

then:

dt[num > 1 & let == "B"]

Upvotes: 1

mkrakhin
mkrakhin

Reputation: 3486

I'm new to R, so could be wrong. It seems that rows are sorted by values in V3 column (if not - sort them), so you can use fast binary search.

extract.country <- function(code) {
    geo[[findInterval(code,geo$V3),"V6"]]
}

Upvotes: 0

Related Questions