Wilcar
Wilcar

Reputation: 2513

Parsing the french digital Gallica api with R

Gallica is the french national digital library. I am working on the 1916' issues of the french newspaper "Ouest Eclair".

Gallica has an API and I can get a list of all ids (ark) and dayOfYear of all issues of 1916 year with this URL:

https://gallica.bnf.fr/services/Issues?ark=ark:/12148/cb41193663x/date&date=1916
# ark is the identifier of the newspaper

example :

    <issue ark="bpt6k567105k" dayOfYear="1">01 janvier 1916</issue>

I am trying to parse the output to a dataframe without succes using the XML package:

library(XML)
data <- xmlParse("https://gallica.bnf.fr/services/Issues?ark=ark:/12148/cb41193663x/date&date=1916")

xml_data <- xmlToList(data)

R gives me this error :

Error: XML content does not seem to be XML: 'https://gallica.bnf.fr/services/Issues?ark=ark:/12148/cb41193663x/date&date=1916'

Upvotes: 1

Views: 86

Answers (1)

mdag02
mdag02

Reputation: 1175

The XML looks like :

<issues compile_time="0:00:14.417" date="1916" list_type="issue" parent_ark="ark:/12148/cb41193663x/date">
    <issue ark="bpt6k567105k" dayOfYear="1">01 janvier 1916</issue>
    <issue ark="bpt6k567106z" dayOfYear="2">02 janvier 1916</issue>
    ...
</issues>

We can extract both the attributes (ark and dayOfYear) and content (litteral date) by getting all the children (xml_find_all(".//issue") then assembling as a dataframe (map_df) all elements built as a row :

library(httr)
library(xml2)
library(tidyverse)

r <- GET("https://gallica.bnf.fr/services/Issues?ark=ark:/12148/cb41193663x/date&date=1916")

r %>%
  content() %>% 
  xml_find_all(".//issue") %>% 
  map_df(~ c(as.list(xml_attrs(.x)), date_parution = xml_text(.x)))

Result :

# A tibble: 366 x 3
   ark          dayOfYear date_parution  
   <chr>        <chr>     <chr>          
 1 bpt6k567105k 1         01 janvier 1916
 2 bpt6k567106z 2         02 janvier 1916
 3 bpt6k567107b 3         03 janvier 1916
 4 bpt6k567108q 4         04 janvier 1916
 5 bpt6k5671093 5         05 janvier 1916
 6 bpt6k5671101 6         06 janvier 1916
 7 bpt6k567111d 7         07 janvier 1916
 8 bpt6k567112s 8         08 janvier 1916
 9 bpt6k5671135 9         09 janvier 1916
10 bpt6k567114j 10        10 janvier 1916
# ... with 356 more rows

If you have a proxy, use :

GET("https://gallica.bnf.fr/services/Issues?ark=ark:/12148/cb41193663x/date&date=1916",
    use_proxy("your.proxy.address",
              port = 8080,
              username = "user",
              password = "password"))

Upvotes: 1

Related Questions