Reputation: 372
Let me start by saying that I have exhausted all the various options that I could come up with on my own, and researched each option to all visible dead ends.
I have a typical mysql forum database that includes a post
table with about 880,000 rows. The post
table contains a column for ip address, and my end goal is to create a bubble map of the world based on the geolocation of each post. Even better if I can separate them per month and create an animation of the post frequency around the world for the past 8 years.
Because this is a personal project and accuracy of IP geolocation is not important, I have had to rule out the paid APIs that can batch convert IP to geolocation. I found various questions on stackoverflow that linked to a website with databases of IP geolocations: https://dev.maxmind.com/geoip/geoip2/geolite2/
My initial plan was to load this database onto my forum server, and use my experience with mysql to create a new table with just: postid, date (as a unix timestamp), latitude, longitude, city, country. Then export this table to R and generate all the maps and charts I could ever want. However, the geolocations database is more than 3 million rows across two tables, and my dead forum is on a simple shared hosting plan that doesn't allow LOAD DATA. I tried all the solutions in these questions, to no luck: How to import CSV file to MySQL table access denied for load data infile in MySQL LOAD DATA INFILE within PHPmyadmin PHPMyAdmin saying: The used command is not allowed with this MySQL version
So my next idea was to export the relevant columns from my post table to .csv or .xml, then upload those to my account at iacademy3.oracle.com. However, I'm not experienced in oracle and the only method I knew is the Data Load/Unload UI in the Data Workshop. The 177MB XML file failed to upload with the following error:
ORA-31011: XML parsing failed ORA-19202: Error occurred in XML processing LPX-00222: error received from SAX callback function
Error loading XML.
Return to application.
The 34MB .csv file failed to upload on two attempts with this error:
Failure of Web Server bridge:
No backend server available for connection: timed out after 10 seconds or idempotent set to OFF or method not idempotent.
Now I'm out of ideas. On a post by post basis, it's a simple query to look at the post IP, compare it to the geolocations database, and have the latitude and longitude. But when working with millions of rows, I don't know how to get to my end result.
Any advice on new approaches or help with my dead ends would be greatly appreciated.
Upvotes: 1
Views: 391
Reputation: 78812
We'll generate some IP addresses, geolocate tem and plot them:
library(iptools)
library(rgeolocate)
library(tidyverse)
Generate a million (way too uniformly distributed) random IPv4 addresses:
ips <- ip_random(1000000)
And, geolocate them:
system.time(
rgeolocate::maxmind(
ips, "~/Data/GeoLite2-City.mmdb", c("longitude", "latitude")
) -> xdf
)
## user system elapsed
## 5.016 0.131 5.217
5s for 1m IPv4s. 👍🏼
Now due to the uniformity, the bubbles will be stupid small, so just for this example we'll round them up a bit:
xdf %>%
mutate(
longitude = (longitude %/% 5) * 5,
latitude = (latitude %/% 5) * 5
) %>%
count(longitude, latitude) -> pts
And, plot them:
ggplot(pts) +
geom_point(
aes(longitude, latitude, size = n),
shape=21, fill = "steelblue", color = "white", stroke=0.25
) +
ggalt::coord_proj("+proj=wintri") +
ggthemes::theme_map() +
theme(legend.justification = "center") +
theme(legend.position = "bottom")
You can see what I mean abt "too uniform". But, you have "real" IPv4s, so you should be gtg.
Consider using scale_size_area()
, but, honestly, consider not plotting IPv4s on a geo-map at all. I do internet-scale research for a living and the accuracy claims leave much to be desired. I rarely go below country-level attribution for that reason (and we pay for "real" data).
Upvotes: 3