How to scrape data from GDELT

Question

I am struggling to scrape data from GDELT.

http://data.gdeltproject.org/events/index.html

I aim to write code that automatically downloads, unzips, and merges files during specific periods, but despite numerous attempts, I have failed to do so.

Although the "gdeltr2" package exists, it does not retrieve some variables correctly from the original data.

I need your help.

Till · Accepted Answer

The rvest package has the appropriate tools for this. We extract the href attributes from all link ... nodes, filter down to those that end with ".CSV.zip" and build the full URLs. Now we can download each file and readr::read_tsv() will unpack, read, and combine the files for us!

library(rvest)
library(tidyverse)

gdelt_index_url <- 
  "http://data.gdeltproject.org/events"

gdelt_dom <- read_html(gdelt_index_url)

url_df <- 
  gdelt_dom |> 
  html_nodes("a") |> 
  html_attr("href") |> 
  tibble() |> 
  set_names("path") |> 
  filter(str_detect(path, ".CSV.zip$")) |> 
  mutate(url = file.path(gdelt_index_url, path)) |> 
  slice(1:3) # For the purpose of demonstration we use only the first three files
  
map2(url_df$url,
     url_df$path,
     download.file)

gdelt_event_data <- 
  read_tsv(url_df$path, col_names = FALSE)

How to scrape data from GDELT

Answers (1)

Related Questions