jacobsab
jacobsab

Reputation: 21

webscraping in R

I am trying to scrape the website property 24 website. However it returns extra data rows which is not on the page. Here is my code.

library(rvest)
property<- read_html("https://www.property24.com/houses-for-sale/cape-
   town/western-cape/432")
price <-property%>% html_nodes(".p24_price") %>% html_text()
desc  <-property%>% html_nodes(".p24_excerpt")%>%html_text()
title <-property%>% html_nodes(".p24_title")%>%html_text() 



price = gsub("[^0-9]","", price) 
desc = gsub("[ \t]{2,}", "", desc) 
desc = gsub("\r\n", "", desc) 
desc = strtrim(desc,100)

property_table<-data.frame(price,title,desc)

Upvotes: 2

Views: 247

Answers (1)

janos
janos

Reputation: 124646

The problem is that the price, title, desc vectors have different lengths.

Why is that? Look at their contents.

You will find that some values don't look like a proper price or description. Because the patterns .p24_price and .p24_excerpt are not specific enough. You need to look at the page source, and make the patterns more specific. For example this will be better:

price <- property %>% html_nodes(".p24_content .p24_price") %>% html_text()
desc  <- property %>% html_nodes(".p24_content .p24_excerpt") %>% html_text()
title <- property %>% html_nodes(".p24_content .p24_title") %>% html_text() 

But I see at least one more problem. Some properties have more than one price, for example:

From R 12 250 000 to R 13 995 999

So the way you extract the price part using gsub needs improvement too.

Upvotes: 1

Related Questions