Felix Zhao
Felix Zhao

Reputation: 489

rvest scraping data with different length

As a practice project, I am trying to scrape property data from a website. (I only intend to practice my web scraping skills with no intention to further take advantage of the data scraped). But I found that some properties don't have price available, therefore, this creates an error of different length when I am trying to combine them into one data frame.

Here is the code for scraping:

library(tidyverse)
library(revest)

web_page <- read_html("https://wx.fang.anjuke.com/loupan/all/a1_p2/")

community_name <- web_page %>% 
  html_nodes(".items-name") %>% 
  html_text()

length(community_name)

listed_price <- web_page %>% 
  html_nodes(".price") %>% 
  html_text()

length(listed_price)
property_data <- data.frame(
  name=community_name,
  price=listed_price
)

How can I identity the property with no listed price and fill the price variable with NA when there is no value scraped?

Upvotes: 0

Views: 220

Answers (1)

neilfws
neilfws

Reputation: 33782

Inspection of the web page shows that the class is .price when price has a value, and .price-txt when it does not. So one solution is to use an XPath expression in html_nodes() and match classes that start with "price":

listed_price <- web_page %>% 
  html_nodes(xpath = "//p[starts-with(@class, 'price')]") %>% 
  html_text()

length(listed_price)
[1] 60

Upvotes: 1

Related Questions