Reputation: 129
I am trying to import a table from a web page using the readHtmlTable function, this is the first few rows of what the data looks like when it is in R.
Event Athlete Country Result Medal year
1 100m Men Tom Burke USA 12.0 GOLD 1896
2 Fritz Hofmann DEU 12.2 est. SILVER 1896
3 Francis Lane USA 12.6 BRONZE 1896
4 Alajos Szokolyi HUN 12.6 est. BRONZE 1896
5 400m Men Tom Burke USA 54.2 GOLD 1896
6 Herbert Jamison USA n/a SILVER 1896
7 Charles Gmelin GBR n/a BRONZE 1896
8 800m Men Teddy Flack AUS 2:11.0 GOLD 1896
9 Nֳ¡ndor Dֳ¡ni HUN 2:11.8 est. SILVER 1896
10 Demitrios Golemis GRE n/a BRONZE 1896
Now if you look at the event column you can see some of the rows of the event field empty, this is the way the table is on the website, what I am looking for is the most efficient way to fill those blanks so at the end it should look like this
Event Athlete Country Result Medal year
1 100m Men Tom Burke USA 12.0 GOLD 1896
2 100m Men Fritz Hofmann DEU 12.2 est. SILVER 1896
3 100m Men Francis Lane USA 12.6 BRONZE 1896
4 100m Men Alajos Szokolyi HUN 12.6 est. BRONZE 1896
5 400m Men Tom Burke USA 54.2 GOLD 1896
Basically every time a field in the Event column is empty I need to fill it with the last value that was not empty. The column is saved in R as a factor and I know that technically I can do this using a for loop and going over all of the vector elements but this is very time consuming considering the fat that there are about 300000 rows in this table. I am hoping for something more efficient
Upvotes: 0
Views: 136
Reputation: 5520
The fill
function from the R Studio package tidyr
is designed to do this kind of data cleaning. We'll use it with mutate
from the dplyr
package which will first replace missing values with NA
. Your data is in data frame df
Code looks like
library(dplyr)
library(tidyr)
df_fill <- df %>% mutate(Event = ifelse(Event == "", NA,Event)) %>%
fill(Event)
Upvotes: 1
Reputation: 38500
Here is a base R method using rle
and inverse.rle
:
# create run length encoding
temp <- rle(df$event)
# fill in missing values with previous values
temp$values[is.na(temp$values)] <- temp$values[which(is.na(temp$values))-1]
# fill in vector of data.frame
df$event <- inverse.rle(temp)
Note that this won't work if there is a missing value in the first position of the variable.
Upvotes: 0
Reputation: 1222
Here's a toy example of how the purrr package could be used to solve your problem, assuming the data is in a data.frame
and the missing values are NA
:
library(purrr)
df <- data.frame("event" = c(1, NA, 2, NA, 3, NA, 5), "other" = 1:7)
df
# event other
# 1 1 1
# 2 NA 2
# 3 2 3
# 4 NA 4
# 5 3 5
# 6 NA 6
# 7 5 7
df$event <- accumulate(.x = df$event, .f = function(x, y) { if(is.na(y)) x else y })
df
# event other
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
# 5 3 5
# 6 3 6
# 7 5 7
Upvotes: 1