SteveS
SteveS

Reputation: 4040

How to query wikipedia using WikipediR package or API?

I want to get all names of all meal names from Wikipedia:

https://en.wikipedia.org/wiki/Lists_of_prepared_foods

How can I query it in R?

There is a query function but without good example how to do this.

Upvotes: 0

Views: 440

Answers (1)

s__
s__

Reputation: 9485

I know there is a package called wikipedir that helps, but also rvest could be helpful:

library(rvest)    
URL <- "https://en.wikipedia.org/wiki/Lists_of_prepared_foods"    
temp <- URL %>% 
        read_html %>%
        html_nodes("#mw-content-text h3+ ul a , .column-width a") %>%  html_text()

[1] "List of almond dishes"                     "List of ancient dishes"                    "List of avocado dishes"                   
  [4] "List of bacon substitutes"                 "List of baked goods"                       "List of breakfast beverages"              
  [7] "List of breakfast cereals"                 "List of breakfast foods"                   "List of cabbage dishes"                   
 [10] "List of cakes"                             "List of candies"                           "List of carrot dishes" ... (trunc. output)

EDIT

To scrape the names in each page, I advice you to make a loop to solve the problem, using the vector temp created above but scraping the links:

temp <- URL %>% 
        read_html %>%
        html_nodes("#mw-content-text h3+ ul a , .column-width a")  %>% html_attr('href')
temp
  [1] "/wiki/List_of_almond_dishes"                     "/wiki/List_of_ancient_dishes"                   
  [3] "/wiki/List_of_avocado_dishes"                    "/wiki/List_of_bacon_substitutes"  ... trunc. output)

Now you create an empty list to populate with the foods for each link:

# an empty list
listed <- list()

for (i in temp) {
  # here you create the url made by https... + the scraped urls above
  url <- paste0("https://en.wikipedia.org/",i)

  # for each url, you'll have a component of the list with the extracted names
  listed[[i]] <- url %>% 
                 read_html %>%
                 # be sure to get the correct nodes, they seems these
                 html_nodes("h2~ ul li > a:nth-child(1) , a a")  %>% html_text()
  Sys.sleep(15)  # very important: you'll add a 15 sec after each link scraped
                 # to not overload of requests the site in a small range of time
}

As result:

$`/wiki/List_of_almond_dishes`
 [1] "Ajoblanco"                "Almond butter"            "Alpen (food)"             "Amandine (culinary term)" "Amlu"                    
 [6] "Bakewell tart"            "Bear claw (pastry)"       "Bethmännchen"             "Biscuit Tortoni"          "Blancmange"              
[11] "Christmas cake"           "Churchkhela"              "Ciarduna"                 "Colomba di Pasqua"        "Comfit"                  
[16] "Coucougnette"             "Crème de Noyaux"          "Cruncheroos"              "Dacquoise"                "Daim bar"                
[21] "Dariole"                  "Esterházy torte"   ... (trunc. output)

$`/wiki/List_of_ancient_dishes`
  [1] "Anfu ham"           "Babaofan"           "Bread"              "Flatbread"          "Focaccia"           "Mantou"            
  [7] "Chili pepper"       "Chutney"            "Congee"             "Curry"              "Doubanjiang"        "Fish sauce"        
 [13] "Forcemeat"          "Garum"              "Ham"                "Harissa"            "Jeok"               "Jusselle"          
 [19] "Liquamen"           "Maccu"              "Misu karu"          "Moretum"            "Nian gao"           "Noodle"  ... (trunc. output)

Upvotes: 2

Related Questions