Vaibhav Singh
Vaibhav Singh

Reputation: 1209

Scrape multiple tables from Wikipedia in R

I am trying to scrape content of this Wiki Page using rvest library in R

(https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019)

I want to extract 4 tables which contains data wrt release of bollywood films in 2019 (January–March,April–June, July–September,October–December)

Already done

library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019"
webpage <- read_html(url)
tbls <- html_nodes(webpage, "table")

#Then I match with the word opening & I get 4 tables as in wikipedia page, however I am struggling to combine them into one dataframe & store it 

tbls[grep("Opening",tbls,ignore.case = T)]

This Gives error

df <- html_table(tbls[grep("Opening",tbls,ignore.case = T)],fill = T)

I understand because it returned multiple tables, I am missing something subscript somewhere not sure where. Help !

Upvotes: 2

Views: 793

Answers (2)

jazzurro
jazzurro

Reputation: 23574

Here is one way for you while I believe there are better ways to handle your case. When you use the rvest package, you can use SelectGadget. You see that there are 15 tables in the link. First, you want to scrape all tables and create a list object. Then, you want to subset the list with column information. The tables that you want to scrape have Opening as a column name. So I used a logical check to test if there is a column with that name in each of the list element and got the four tables that you want.

library(tidyverse)
library(htmltab)

map(.x = 1:15,
    .f = function(mynum) {htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019",
                                  which = mynum, rm_nodata_cols = F)}) -> res

Filter(function(x) any(names(x) %in% "Opening"), res) -> out

Upvotes: 3

Mislav
Mislav

Reputation: 1563

For complicated HTML tables, I recommend htmltab package:

library(purrr)
library(htmltab)

url <- "https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019"
tbls <- map2(url, 4:7, htmltab)
tbls <- do.call(rbind, tbls)

Upvotes: 1

Related Questions