Reputation: 1209
I am trying to scrape content of this Wiki Page using rvest library in R
(https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019)
I want to extract 4 tables which contains data wrt release of bollywood films in 2019 (January–March,April–June, July–September,October–December)
Already done
library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019"
webpage <- read_html(url)
tbls <- html_nodes(webpage, "table")
#Then I match with the word opening & I get 4 tables as in wikipedia page, however I am struggling to combine them into one dataframe & store it
tbls[grep("Opening",tbls,ignore.case = T)]
df <- html_table(tbls[grep("Opening",tbls,ignore.case = T)],fill = T)
I understand because it returned multiple tables, I am missing something subscript somewhere not sure where. Help !
Upvotes: 2
Views: 793
Reputation: 23574
Here is one way for you while I believe there are better ways to handle your case. When you use the rvest
package, you can use SelectGadget. You see that there are 15 tables in the link. First, you want to scrape all tables and create a list object. Then, you want to subset the list with column information. The tables that you want to scrape have Opening
as a column name. So I used a logical check to test if there is a column with that name in each of the list element and got the four tables that you want.
library(tidyverse)
library(htmltab)
map(.x = 1:15,
.f = function(mynum) {htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019",
which = mynum, rm_nodata_cols = F)}) -> res
Filter(function(x) any(names(x) %in% "Opening"), res) -> out
Upvotes: 3
Reputation: 1563
For complicated HTML tables, I recommend htmltab
package:
library(purrr)
library(htmltab)
url <- "https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2019"
tbls <- map2(url, 4:7, htmltab)
tbls <- do.call(rbind, tbls)
Upvotes: 1