KatSA
KatSA

Reputation: 17

Combining multiple matrices into one data frame

I'm using the following code to scrape info on houses (e.g. address and sales period):

scrape_func <- function(url) {
  library(rvest)
  library(stringr)
  Sys.sleep(0.25)
  
  webpage <- read_html(url)
  boligsiden <- webpage %>%
    html_nodes("script") %>%
    html_text()
  
  webpage <- read_html(url)
  bolig_adresse <- webpage %>%
    html_nodes("div.t-truncate.t-bold") %>%
    html_text()
  
  liggetid_final <- as.matrix(as.numeric(unlist(str_extract_all(boligsiden, '(?<="salesPeriod":")[^"]+'))))
  
  udbudt_final <- as.matrix(unlist(str_extract_all(boligsiden, '(?<="dateAdded":")[^"]+')), tryFormats=c("%d-%m-%Y"))
  
  solgt_final <- as.matrix(unlist(str_extract_all(boligsiden, '(?<="dateRemoved":")[^"]+')), tryFormats=c("%d-%m-%Y"))
  
  scrape <- unique(cbind(bolig_adresse[1], liggetid_final, udbudt_final, solgt_final))
  
  return(scrape)
  
}

links <- sapply(df_all$link, as.character)

scrape <- data.frame(rbind(sapply(links, scrape_func)))

When I run the last line(scrape <- data.frame(rbind(sapply(links, scrape_func)))), I get the following warning message:

In cbind(bolig_adresse[1], liggetid_final, udbudt_final, solgt_final) :
number of rows of result is not a multiple of vector length (arg 1),

and the data frame just consists of 1 obs and 60 variables. There are 60 different webpages, but multiple lines for each webpage (since there are multiple sales for most of the houses), so the end data frame should contain more than 60 rows and exactly 4 columns.

When I just use scrape <- unique(cbind(adresse[1], liggetid_final, udbudt_final, solgt_final)) on one website (for one house), it works just fine.

Scaling it up, I just want r to stack the matrices on top of each other into one big data frame, but I just can't figure out how to do that.

Additional info:

An example of a url: https://www.boligsiden.dk/adresse/ledoejevej-21-2620-albertslund-01650298__21_______ For this particular municipality there is around 3000 different houses and I need to combine the data from each of them into one data set.

One url might return: address1

And another may return: address2

However, it seems that the do.call does the trick, but I still get the warning messages.

Upvotes: 1

Views: 724

Answers (1)

CSJCampbell
CSJCampbell

Reputation: 2115

One of your URLs is returning zero rows, raising the warning. Your code is probably also flattening out the result.

scrape_func <- function(m) {
    unique(cbind(m[1], m))
}
links <- list(as.matrix(head(women, n = 2)), as.matrix(head(women, n = 0)))
scrape <- data.frame(rbind(sapply(links, scrape_func)))
# Warning message:
#     In cbind(m[1], m) :
#     number of rows of result is not a multiple of vector length (arg 1)
scrape
#                         X1 X2
# 1 58, 58, 58, 59, 115, 117   

Instead you want to enforce the structure of the output of scrape_func despite the result.

scrape_func <- function(m) {
    scrape <- matrix(NA, ncol = 2, nrow = max(length(m[1]), length(m)), 
        dimnames = list(NULL, c("bolig_adresse", "liggetid_final")))
    scrape[seq_along(m[1]), 1] <- m[1]
    scrape[seq_along(c(m)), 2] <- c(m)
    as.data.frame(unique(scrape))
}
links <- list(as.matrix(head(women, n = 2)), as.matrix(head(women, n = 0)))
scrape <- dplyr::bind_rows(lapply(links, scrape_func))
scrape
#   bolig_adresse liggetid_final
# 1            58             58
# 2            NA             59
# 3            NA            115
# 4            NA            117
# 5            NA             NA

Also, prefer to avoid using sapply, having your code decide when to simplify the output almost always leads to bugs at some point. :)

Upvotes: 2

Related Questions