dmitre88
dmitre88

Reputation: 30

How to extract text from a several "div class" (html) using R?

My goal is to extract info from this html page to create a database: https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing

One of the variables is the price of the apartments. I've identified that some have the div class="row_price" code which includes the price (example A) but others don't have this code and therefore the price (example B). Hence I would like that R could read the observations without the price as NA, otherwise it will mixed the database by giving the price from the observation that follows.

Example A

<div class="listing_column listing_row_price">
    <div class="row_price">
      $ 14,800
    </div>
<div class="row_info">Ayer&nbsp;19:53</div>

Example B

<div class="listing_column listing_row_price">

<div class="row_info">Ayer&nbsp;19:50</div>

I think that if I extract the text from "listing_row_price" to the beginning of the "row_info" in a character vector I will be able to get my desired output, which is:

...
10 4000
11 14800
12 NA
13 14000
14 8000
...

But so far I've get this one and another full with NA.

...
10 4000
11 14800
12 14000
13 8000
14 8500
...

Commands used but didn't get what I want:

    html1<-read_html("file.html")
    title<-html_nodes(html1,"div")
    html1<-toString(title)
    pattern1<-'div class="row_price">([^<]*)<'
    title3<-unlist(str_extract_all(title,pattern1))
    title3<-title3[c(1:35)]
    pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
    title3<-unlist(str_extract(title3,pattern2))
    title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
    title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))

I also try with pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)< that I think it says to extract the "listing_row_price" part, then if exist extract the "row_price" part, later get the digits and finally extract the < thats follows.

Upvotes: 0

Views: 3522

Answers (1)

alistaire
alistaire

Reputation: 43354

There are lots of ways to do this, and depending on how consistent the HTML is, one may be better than another. A reasonably simple strategy that works in this case, though:

library(rvest)

page <- read_html('page.html')

# find all nodes with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')

# for each listing, if it has two children get the text of the first, else return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2, 
                                              html_text(html_children(x)[1]), 
                                              NA)})
# replace everything that's not a number with nothing, and turn it into an integer
prices <- as.integer(gsub('[^0-9]', '', prices))

Upvotes: 2

Related Questions