Reputation: 83
I am struggling to scrape data from GDELT.
http://data.gdeltproject.org/events/index.html
I aim to write code that automatically downloads, unzips, and merges files during specific periods, but despite numerous attempts, I have failed to do so.
Although the "gdeltr2" package exists, it does not retrieve some variables correctly from the original data.
I need your help.
Upvotes: 0
Views: 512
Reputation: 6628
The rvest
package has the appropriate tools for this. We extract the href
attributes from all link <a href = ...>...</a>
nodes, filter down to those that end with ".CSV.zip" and build the full URLs. Now we can download each file and readr::read_tsv()
will unpack, read, and combine the files for us!
library(rvest)
library(tidyverse)
gdelt_index_url <-
"http://data.gdeltproject.org/events"
gdelt_dom <- read_html(gdelt_index_url)
url_df <-
gdelt_dom |>
html_nodes("a") |>
html_attr("href") |>
tibble() |>
set_names("path") |>
filter(str_detect(path, ".CSV.zip$")) |>
mutate(url = file.path(gdelt_index_url, path)) |>
slice(1:3) # For the purpose of demonstration we use only the first three files
map2(url_df$url,
url_df$path,
download.file)
gdelt_event_data <-
read_tsv(url_df$path, col_names = FALSE)
Upvotes: 4