Reputation: 49
Problem: During scraping from webpage (imdb.com, webpage with film details) there is error message displayed. When I have checked in details, I have noticed that there is no data available for some of the entries. How to figured out during scraping for which line there is no data and how to fill it with NA?
Manual investigation: I have checked on webpage manually, and problem is with the rank number 1097 where there is only film genre available and there is no runtime.
Tried: to add if entering the 0, but it is added to the last line, not to the title which is missing the value.
Code:
#install packages
install.packages("rvest")
install.packages("RSelenium")
library(rvest)
library(RSelenium)
#open browser (in my case Firefox)
rD <- rsDriver(browser=c("firefox"))
remDr <- rD[["client"]]
#set variable for the link
ile<-seq(from=1, by=250, length.out = 5)
#create empty frame
filmy_df=data.frame()
#empty values
rank_data<-NA;link<-NA;year<-NA;title_data<-NA;description_data<-NA;runtime_data<-NA;genre_data<-NA
#loop reading the data from each page
for (j in ile){
#set link for browser
newURL<-"https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
startNumberURL<-paste0(newURL,j)
#open link
remDr$navigate(startNumberURL)
#read webpage code
strona_int<-read_html(startNumberURL)
#empty values
rank_data<-NA;link<-NA;year<-NA;title_data<-NA;description_data<-NA;runtime_data<-NA;genre_data<-NA
#read rank
rank_data<-html_nodes(strona_int,'.text-primary')
#convert text
rank_data<-html_text(rank_data)
#remove the comma for thousands
rank_data<-gsub(",","",rank_data)
#convert numeric
rank_data<-as.numeric(rank_data)
#read link for each movie
link<-url_absolute(html_nodes(strona_int, '.lister-item-header a')%>%html_attr(.,'href'),"https://www.imdb.com")
#release year
year<-html_nodes(strona_int,'.lister-item-year')
#convert text
year<-html_text(year)
#remove non numeric
year<-gsub("\\D","",year)
#set factor
year<-as.factor(year)
#read title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#convert text
title_data<-html_text(title_data)
#title_data<-as.character(title_data)
#read description
description_data<-html_nodes(strona_int,'.ratings-bar+ .text-muted')
#convert text
description_data<-html_text(description_data)
#remove '\n'
description_data<-gsub("\n","",description_data)
#remove space
description_data<-trimws(description_data,"l")
#read runtime
runtime_data <- html_nodes(strona_int,'.text-muted .runtime')
#convert text
runtime_data <- html_text(runtime_data)
#remove min
runtime_data<-gsub(" min","",runtime_data)
length_runtime_data<-length(runtime_data)
#if (length_runtime_data<250){ runtime_data<-append(runtime_data,list(0))}
runtime_data<-as.numeric(runtime_data)
#temp_df
filmy_df_temp<-data.frame(Rank=rank_data,Title=title_data,Release.Year=year,Link=link,Description=description_data,Runtime=runtime_data)
#add to df
filmy_df<-rbind(filmy_df,filmy_df_temp)
}
#close browser
remDr$close()
#stop RSelenium
rD[["server"]]$stop()
Error message displayed:
"Error in data.frame(Rank = rank_data, Title = title_data, Release.Year = year, : arguments imply differing number of rows: 250, 249"
Runtime_data contains only 249 entries instead of 250, and there is no runtime data for last line (instead for the line where it is really missing).
Update I have found interesting think which maybe can help to solve the problem. Please check the pictures. Anima - source of error Knocked Up - next entry
When we compare the pictures, we can notice that Anima, which is causing the problem with runtime_data, do not have html_node containing runtime at all.
So question: is there a way to check if html_node exists or not? If yes, how to do this?
Upvotes: 0
Views: 490
Reputation: 17090
You wouldn't run into this problem if you have structured your program a little different. In general, it is better to split your program into logically separate chunks that are more or less independent from each other instead of doing everything at once. That makes debugging much easier.
First, scrape the data and store it in a list – use lapply
or something similar for that.
newURL <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
pages <- lapply(ile, function(j) { #set link for browser
startNumberURL<-paste0(newURL,j)
#open link
remDr$navigate(startNumberURL)
#read webpage code
read_html(startNumberURL)
})
Then you have your data scraped and you can take all the time you need to analyse and filter it, without having to start the reading process again. For example, define the function as follows:
parsuj_strone <- function(strona_int) {
#read rank
rank_data<-html_nodes(strona_int,'.text-primary')
#convert text
rank_data<-html_text(rank_data)
#remove the comma for thousands
rank_data<-gsub(",","",rank_data)
#convert numeric
rank_data<-as.numeric(rank_data)
#read link for each movie
link<-url_absolute(html_nodes(strona_int, '.lister-item-header a')%>%html_attr(.,'href'),"https://www.imdb.com")
#release year
year<-html_nodes(strona_int,'.lister-item-year')
#convert text
year<-html_text(year)
#remove non numeric
year<-gsub("\\D","",year)
#set factor
year<-as.factor(year)
#read title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#convert text
title_data<-html_text(title_data)
#read description
description_data<-html_nodes(strona_int,'.ratings-bar+ .text-muted')
#convert text
description_data<-html_text(description_data)
#remove '\n'
description_data<-gsub("\n","",description_data)
#remove space
description_data<-trimws(description_data,"l")
#read runtime
runtime_data <- html_nodes(strona_int,'.text-muted .runtime')
#convert text
runtime_data <- html_text(runtime_data)
#remove min
runtime_data<-gsub(" min","",runtime_data)
length_runtime_data<-length(runtime_data)
runtime_data<-as.numeric(runtime_data)
#temp_df
filmy_df_temp<- data.frame(Rank=rank_data,Title=title_data,Release.Year=year,Link=link,Description=description_data,Runtime=runtime_data)
return(filmy_df_temp)
}
Now, apply the function to each scraped web site:
pages_parsed <- lapply(pages, parsuj_strone)
And finally put them together in the data frame:
pages_df <- Reduce(rbind, pages_parsed)
Reduce
won't mind an occasional NULL. Powodzenia!
EDIT: OK, so the problem is in the parsuj_strone()
function. First, replace the final line of that function by this:
filmy_df_temp<- list(Rank=rank_data,
Title=title_data,
Release.Year=year, Link=link,
Description=description_data,
Runtime=runtime_data)
return(filmy_df_temp)
Run
pages_parsed <- lapply(pages, parsuj_strone)
Then, identify which of the 5 web sites returned problematic entries:
sapply(pages_parsed, function(x) sapply(x, length))
This should give you a 5 x 6 matrix. Finally, pick an element which has only 249 entries; how does it look? Without knowing your parser well, this at least should give you a hint where the problems may be.
Upvotes: 1
Reputation: 49
On Stackoverflow you will find everything. You just need to know how to search and search answer. Here is the link to the answer for my problem: Scraping with rvest: how to fill blank numbers in a row to transform in a data frame?
In short: instead of using html_nodes, html_node (without s) should be used.
#read runtime
runtime_data <- html_node(szczegoly_filmu,'.text-muted .runtime')
#convert to text
runtime_data <- html_text(runtime_data)
#remove " min"
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)
Upvotes: 0