Reputation: 30
My goal is to extract info from this html page to create a database: https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing
One of the variables is the price of the apartments. I've identified that some have the div class="row_price"
code which includes the price (example A) but others don't have this code and therefore the price (example B). Hence I would like that R could read the observations without the price as NA
, otherwise it will mixed the database by giving the price from the observation that follows.
<div class="listing_column listing_row_price">
<div class="row_price">
$ 14,800
</div>
<div class="row_info">Ayer 19:53</div>
<div class="listing_column listing_row_price">
<div class="row_info">Ayer 19:50</div>
I think that if I extract the text from "listing_row_price" to the beginning of the "row_info" in a character vector I will be able to get my desired output, which is:
...
10 4000
11 14800
12 NA
13 14000
14 8000
...
But so far I've get this one and another full with NA
.
...
10 4000
11 14800
12 14000
13 8000
14 8500
...
Commands used but didn't get what I want:
html1<-read_html("file.html")
title<-html_nodes(html1,"div")
html1<-toString(title)
pattern1<-'div class="row_price">([^<]*)<'
title3<-unlist(str_extract_all(title,pattern1))
title3<-title3[c(1:35)]
pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
title3<-unlist(str_extract(title3,pattern2))
title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))
I also try with pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)<
that I think it says to extract the "listing_row_price" part, then if exist extract the "row_price" part, later get the digits and finally extract the <
thats follows.
Upvotes: 0
Views: 3522
Reputation: 43354
There are lots of ways to do this, and depending on how consistent the HTML is, one may be better than another. A reasonably simple strategy that works in this case, though:
library(rvest)
page <- read_html('page.html')
# find all nodes with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')
# for each listing, if it has two children get the text of the first, else return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2,
html_text(html_children(x)[1]),
NA)})
# replace everything that's not a number with nothing, and turn it into an integer
prices <- as.integer(gsub('[^0-9]', '', prices))
Upvotes: 2