Reputation: 2583
How to search in data.table by condition?
I have a table for, let's say, users which contains user ip + a table of ip ranges from http://dev.maxmind.com/geoip/legacy/csv/. I would like to get a country for each user. The problem is that in order to find a country you need to split the IP to tokens and apply some formula to get an integer value representing the address. Then you need to find a row where that value fits in range.
I managed to do that with data.frame's but it is very slow..
extract.country <- function(code) {
geo[(geo['V3']<=code & code<=geo['V4'])][6]
}
There are 93100 ip addresses ranges and relatively same number of users. But it seems to take about
user system elapsed
17.109 1.144 20.649
to process 100 users on my machine. Which will be about 5 hours to process all of them.
That is how my & geoIP datasets are look like:
> head(dat)
V2 V26
1 2014-03-01 14:06:59 86.183.184.19
2 2014-03-01 23:50:02 86.112.53.139
3 2014-03-01 15:07:02 5.69.149.65
4 2014-03-01 14:27:18 218.186.19.230
5 2014-03-01 13:08:31 86.0.151.153
6 2014-03-01 23:18:00 79.148.42.6
>
> head(geo)
V1 V2 V3 V4 V5 V6
1 1.0.0.0 1.0.0.255 16777216 16777471 AU Australia
2 1.0.1.0 1.0.3.255 16777472 16778239 CN China
3 1.0.4.0 1.0.7.255 16778240 16779263 AU Australia
4 1.0.8.0 1.0.15.255 16779264 16781311 CN China
5 1.0.16.0 1.0.31.255 16781312 16785407 JP Japan
6 1.0.32.0 1.0.63.255 16785408 16793599 CN China
Where for each IP I calculate an int value, like
[1] 1454880787 1450194315 88446273 3669627878 1442879385 1335110150
I wanted to improve the performance with a data.table as columns might be indexed and performance is really good for exact search. But I can't figure out how to make a conditional search with that.
Or, is there any other way to do that with R?
Upvotes: 1
Views: 651
Reputation: 31161
This should be what you're looking for:
dt<-data.table(num=c(1,2,3,1,3,4,6,7,7,7,8,3), let=(rep(c("A","B","C"), each=4)))
then:
dt[num > 1 & let == "B"]
Upvotes: 1
Reputation: 3486
I'm new to R, so could be wrong. It seems that rows are sorted by values in V3 column (if not - sort them), so you can use fast binary search.
extract.country <- function(code) {
geo[[findInterval(code,geo$V3),"V6"]]
}
Upvotes: 0