Reputation: 2017
I was wondering if its possible to make the return of a function verbose.
For example, I'm using the function below to scrape some data and add each entry as a new row in an existant dataframe.
new_function <- function() {
for (i in 1:nrow(temp_data)) {
temp_data_point <- temp_data[i, ]
file <- read_html(temp_data_point)
tables <- html_nodes(file, "table")
table1 <- html_table(tables[8], fill = TRUE)
table2 <- as.data.frame(table1)
table2 <- table2[15:24 , 1:2]
colnames(table2)[1] <- "variables"
colnames(table2)[2] <- "results"
table2[1, 1] <- "name"
table2[2, 1] <- "legal_form"
table2[3, 1] <- "industry"
table2[4, 1] <- "tax_num"
table2[5, 1] <- "id"
table2[6, 1] <- "account_num"
table2[7, 1] <- "bank_name"
table2[8, 1] <- "address"
table2[9, 1] <- "location"
table2[10, 1] <- "phone"
test2 <- spread(table2, variables, results)
temp_table3[i, ] <- test2
}
return(temp_table3)
}
new_df <- new_function()
However, as I am sending thousands of requests, the function will execute for more than an hour.
What I want to do, other than measure the sys.time at the end, is to have a response, perhaps every minute or so, which prints the number of rows in the dataframe.
Is this possible?
Upvotes: 2
Views: 431
Reputation: 74
The preferred method might be a progress bar. The following is a link to a decent tutorial.
Another option might be to display a message every 10 (or another arbitrary number of) iterations by inserting something like:
if ((i %% 10) == 0) {
message(paste(substitute(temp_table3), "has", nrow(temp_table3), "rows."))
}
The best option may be to insert the number of rows in a message to display in the progress bar label. Optionally, the progress bar can be updated every 10 (or an arbitrary number of) iterations.
The condition to trigger a message could also be based on a time difference which can be evaluated at every iteration.
Upvotes: 0
Reputation: 526
You can keep track of the time you last printed a message (or the time you started the simulation) and have the function print the current index every 60 seconds.
Note that this adds an extra 15.8 microseconds to each loop iteration.
Your code becomes
new_function <- function() {
# Initialize start time
time_print <- as.numeric(Sys.time())
for (i in 1:nrow(temp_data)) {
# Print number of rows every minute
time_now <- as.numeric(Sys.time())
if (time_now - time_print > 60) {
message('Working on row ', i)
time_print <- time_now
}
temp_data_point <- temp_data[i, ]
file <- read_html(temp_data_point)
tables <- html_nodes(file, "table")
table1 <- html_table(tables[8], fill = TRUE)
table2 <- as.data.frame(table1)
table2 <- table2[15:24 , 1:2]
colnames(table2)[1] <- "variables"
colnames(table2)[2] <- "results"
table2[1, 1] <- "name"
table2[2, 1] <- "legal_form"
table2[3, 1] <- "industry"
table2[4, 1] <- "tax_num"
table2[5, 1] <- "id"
table2[6, 1] <- "account_num"
table2[7, 1] <- "bank_name"
table2[8, 1] <- "address"
table2[9, 1] <- "location"
table2[10, 1] <- "phone"
test2 <- spread(table2, variables, results)
temp_table3[i, ] <- test2
}
return(temp_table3)
}
new_df <- new_function()
Upvotes: 2