R scraper - how to find entry with missing data

Question

Problem: During scraping from webpage (imdb.com, webpage with film details) there is error message displayed. When I have checked in details, I have noticed that there is no data available for some of the entries. How to figured out during scraping for which line there is no data and how to fill it with NA?

Manual investigation: I have checked on webpage manually, and problem is with the rank number 1097 where there is only film genre available and there is no runtime.

Tried: to add if entering the 0, but it is added to the last line, not to the title which is missing the value.

Code:

#install packages
install.packages("rvest")
install.packages("RSelenium")
library(rvest)
library(RSelenium)

#open browser (in my case Firefox)
rD <- rsDriver(browser=c("firefox"))
remDr <- rD[["client"]]

#set variable for the link
ile<-seq(from=1, by=250, length.out = 5)

#create empty frame
filmy_df=data.frame()

#empty values
rank_data<-NA;link<-NA;year<-NA;title_data<-NA;description_data<-NA;runtime_data<-NA;genre_data<-NA

#loop reading the data from each page
for (j in ile){
  #set link for browser
  newURL<-"https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
  startNumberURL<-paste0(newURL,j)
  
#open link
remDr$navigate(startNumberURL)

#read webpage code
strona_int<-read_html(startNumberURL)

#empty values
rank_data<-NA;link<-NA;year<-NA;title_data<-NA;description_data<-NA;runtime_data<-NA;genre_data<-NA

#read rank
rank_data<-html_nodes(strona_int,'.text-primary')
#convert text
rank_data<-html_text(rank_data)
#remove the comma for thousands
rank_data<-gsub(",","",rank_data)
#convert numeric
rank_data<-as.numeric(rank_data)

#read link for each movie
link<-url_absolute(html_nodes(strona_int, '.lister-item-header a')%>%html_attr(.,'href'),"https://www.imdb.com")

#release year
year<-html_nodes(strona_int,'.lister-item-year')
#convert text
year<-html_text(year)
#remove non numeric
year<-gsub("\D","",year)
#set factor
year<-as.factor(year)

#read title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#convert text
title_data<-html_text(title_data)
#title_data<-as.character(title_data)

#read description
description_data<-html_nodes(strona_int,'.ratings-bar+ .text-muted')
#convert text
description_data<-html_text(description_data)
#remove '
'
description_data<-gsub("
","",description_data)
#remove space
description_data<-trimws(description_data,"l")

#read runtime
runtime_data <- html_nodes(strona_int,'.text-muted .runtime')
#convert text
runtime_data <- html_text(runtime_data)
#remove min
runtime_data<-gsub(" min","",runtime_data)
length_runtime_data<-length(runtime_data)
#if (length_runtime_data<250){ runtime_data<-append(runtime_data,list(0))}
runtime_data<-as.numeric(runtime_data)


#temp_df
filmy_df_temp<-data.frame(Rank=rank_data,Title=title_data,Release.Year=year,Link=link,Description=description_data,Runtime=runtime_data)

#add to df
filmy_df<-rbind(filmy_df,filmy_df_temp)
}

#close browser
remDr$close()
#stop RSelenium
rD[["server"]]$stop()

Error message displayed:

"Error in data.frame(Rank = rank_data, Title = title_data, Release.Year = year, : arguments imply differing number of rows: 250, 249"

Runtime_data contains only 249 entries instead of 250, and there is no runtime data for last line (instead for the line where it is really missing).

Update I have found interesting think which maybe can help to solve the problem. Please check the pictures. Anima - source of error Knocked Up - next entry

When we compare the pictures, we can notice that Anima, which is causing the problem with runtime_data, do not have html_node containing runtime at all.

So question: is there a way to check if html_node exists or not? If yes, how to do this?

Supek · Accepted Answer

On Stackoverflow you will find everything. You just need to know how to search and search answer. Here is the link to the answer for my problem: Scraping with rvest: how to fill blank numbers in a row to transform in a data frame?

In short: instead of using html_nodes, html_node (without s) should be used.

#read runtime
runtime_data <- html_node(szczegoly_filmu,'.text-muted .runtime')
#convert to text
runtime_data <- html_text(runtime_data)
#remove " min"
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

R scraper - how to find entry with missing data

Answers (2)

Related Questions