Reputation: 17
I'm using the following code to scrape info on houses (e.g. address and sales period):
scrape_func <- function(url) {
library(rvest)
library(stringr)
Sys.sleep(0.25)
webpage <- read_html(url)
boligsiden <- webpage %>%
html_nodes("script") %>%
html_text()
webpage <- read_html(url)
bolig_adresse <- webpage %>%
html_nodes("div.t-truncate.t-bold") %>%
html_text()
liggetid_final <- as.matrix(as.numeric(unlist(str_extract_all(boligsiden, '(?<="salesPeriod":")[^"]+'))))
udbudt_final <- as.matrix(unlist(str_extract_all(boligsiden, '(?<="dateAdded":")[^"]+')), tryFormats=c("%d-%m-%Y"))
solgt_final <- as.matrix(unlist(str_extract_all(boligsiden, '(?<="dateRemoved":")[^"]+')), tryFormats=c("%d-%m-%Y"))
scrape <- unique(cbind(bolig_adresse[1], liggetid_final, udbudt_final, solgt_final))
return(scrape)
}
links <- sapply(df_all$link, as.character)
scrape <- data.frame(rbind(sapply(links, scrape_func)))
When I run the last line(scrape <- data.frame(rbind(sapply(links, scrape_func)))
), I get the following warning message:
In cbind(bolig_adresse[1], liggetid_final, udbudt_final, solgt_final) :
number of rows of result is not a multiple of vector length (arg 1),
and the data frame just consists of 1 obs and 60 variables. There are 60 different webpages, but multiple lines for each webpage (since there are multiple sales for most of the houses), so the end data frame should contain more than 60 rows and exactly 4 columns.
When I just use scrape <- unique(cbind(adresse[1], liggetid_final, udbudt_final, solgt_final))
on one website (for one house), it works just fine.
Scaling it up, I just want r to stack the matrices on top of each other into one big data frame, but I just can't figure out how to do that.
Additional info:
An example of a url: https://www.boligsiden.dk/adresse/ledoejevej-21-2620-albertslund-01650298__21_______ For this particular municipality there is around 3000 different houses and I need to combine the data from each of them into one data set.
One url might return: address1
And another may return: address2
However, it seems that the do.call
does the trick, but I still get the warning messages.
Upvotes: 1
Views: 724
Reputation: 2115
One of your URLs is returning zero rows, raising the warning. Your code is probably also flattening out the result.
scrape_func <- function(m) {
unique(cbind(m[1], m))
}
links <- list(as.matrix(head(women, n = 2)), as.matrix(head(women, n = 0)))
scrape <- data.frame(rbind(sapply(links, scrape_func)))
# Warning message:
# In cbind(m[1], m) :
# number of rows of result is not a multiple of vector length (arg 1)
scrape
# X1 X2
# 1 58, 58, 58, 59, 115, 117
Instead you want to enforce the structure of the output of scrape_func
despite the result.
scrape_func <- function(m) {
scrape <- matrix(NA, ncol = 2, nrow = max(length(m[1]), length(m)),
dimnames = list(NULL, c("bolig_adresse", "liggetid_final")))
scrape[seq_along(m[1]), 1] <- m[1]
scrape[seq_along(c(m)), 2] <- c(m)
as.data.frame(unique(scrape))
}
links <- list(as.matrix(head(women, n = 2)), as.matrix(head(women, n = 0)))
scrape <- dplyr::bind_rows(lapply(links, scrape_func))
scrape
# bolig_adresse liggetid_final
# 1 58 58
# 2 NA 59
# 3 NA 115
# 4 NA 117
# 5 NA NA
Also, prefer to avoid using sapply
, having your code decide when to simplify the output almost always leads to bugs at some point. :)
Upvotes: 2