Reputation: 21
I am trying to scrape the website property 24 website. However it returns extra data rows which is not on the page. Here is my code.
library(rvest)
property<- read_html("https://www.property24.com/houses-for-sale/cape-
town/western-cape/432")
price <-property%>% html_nodes(".p24_price") %>% html_text()
desc <-property%>% html_nodes(".p24_excerpt")%>%html_text()
title <-property%>% html_nodes(".p24_title")%>%html_text()
price = gsub("[^0-9]","", price)
desc = gsub("[ \t]{2,}", "", desc)
desc = gsub("\r\n", "", desc)
desc = strtrim(desc,100)
property_table<-data.frame(price,title,desc)
Upvotes: 2
Views: 247
Reputation: 124646
The problem is that the price
, title
, desc
vectors have different lengths.
Why is that? Look at their contents.
You will find that some values don't look like a proper price or description.
Because the patterns .p24_price
and .p24_excerpt
are not specific enough.
You need to look at the page source, and make the patterns more specific.
For example this will be better:
price <- property %>% html_nodes(".p24_content .p24_price") %>% html_text()
desc <- property %>% html_nodes(".p24_content .p24_excerpt") %>% html_text()
title <- property %>% html_nodes(".p24_content .p24_title") %>% html_text()
But I see at least one more problem. Some properties have more than one price, for example:
From R 12 250 000 to R 13 995 999
So the way you extract the price part using gsub
needs improvement too.
Upvotes: 1