Prometheus
Prometheus

Reputation: 2017

verbose return of a function in R

I was wondering if its possible to make the return of a function verbose.

For example, I'm using the function below to scrape some data and add each entry as a new row in an existant dataframe.

new_function <- function() {


for (i in 1:nrow(temp_data)) {

  temp_data_point <- temp_data[i, ]
  file <- read_html(temp_data_point)
  tables <- html_nodes(file, "table")
  table1 <- html_table(tables[8], fill = TRUE)
  table2 <- as.data.frame(table1)
  table2 <- table2[15:24 , 1:2]


  colnames(table2)[1] <- "variables"
  colnames(table2)[2] <- "results"


  table2[1, 1] <- "name"
  table2[2, 1] <- "legal_form"
  table2[3, 1] <- "industry"
  table2[4, 1] <- "tax_num"
  table2[5, 1] <- "id"
  table2[6, 1] <- "account_num"
  table2[7, 1] <- "bank_name"
  table2[8, 1] <- "address"
  table2[9, 1] <- "location"
  table2[10, 1] <- "phone"

  test2 <- spread(table2, variables, results)
  temp_table3[i, ] <- test2

}

return(temp_table3)

}

new_df <- new_function()

However, as I am sending thousands of requests, the function will execute for more than an hour.

What I want to do, other than measure the sys.time at the end, is to have a response, perhaps every minute or so, which prints the number of rows in the dataframe.

Is this possible?

Upvotes: 2

Views: 431

Answers (2)

jobou
jobou

Reputation: 74

The preferred method might be a progress bar. The following is a link to a decent tutorial.

Another option might be to display a message every 10 (or another arbitrary number of) iterations by inserting something like:

if ((i %% 10) == 0) {
    message(paste(substitute(temp_table3), "has", nrow(temp_table3), "rows."))
}

The best option may be to insert the number of rows in a message to display in the progress bar label. Optionally, the progress bar can be updated every 10 (or an arbitrary number of) iterations.

The condition to trigger a message could also be based on a time difference which can be evaluated at every iteration.

Upvotes: 0

Ben Fasoli
Ben Fasoli

Reputation: 526

You can keep track of the time you last printed a message (or the time you started the simulation) and have the function print the current index every 60 seconds.

Note that this adds an extra 15.8 microseconds to each loop iteration.

Your code becomes

new_function <- function() {

  # Initialize start time
  time_print <- as.numeric(Sys.time())

  for (i in 1:nrow(temp_data)) {

    # Print number of rows every minute
    time_now <- as.numeric(Sys.time())
    if (time_now - time_print > 60) {
      message('Working on row ', i)
      time_print <- time_now
    }

    temp_data_point <- temp_data[i, ]
    file <- read_html(temp_data_point)
    tables <- html_nodes(file, "table")
    table1 <- html_table(tables[8], fill = TRUE)
    table2 <- as.data.frame(table1)
    table2 <- table2[15:24 , 1:2]


    colnames(table2)[1] <- "variables"
    colnames(table2)[2] <- "results"


    table2[1, 1] <- "name"
    table2[2, 1] <- "legal_form"
    table2[3, 1] <- "industry"
    table2[4, 1] <- "tax_num"
    table2[5, 1] <- "id"
    table2[6, 1] <- "account_num"
    table2[7, 1] <- "bank_name"
    table2[8, 1] <- "address"
    table2[9, 1] <- "location"
    table2[10, 1] <- "phone"

    test2 <- spread(table2, variables, results)
    temp_table3[i, ] <- test2

  }

  return(temp_table3)

}

new_df <- new_function()

Upvotes: 2

Related Questions