Lee
Lee

Reputation: 129

Replace missing values in a vector with the last value that was not missing in R

I am trying to import a table from a web page using the readHtmlTable function, this is the first few rows of what the data looks like when it is in R.

              Event                Athlete Country      Result  Medal year
1          100m Men              Tom Burke     USA        12.0   GOLD 1896
2                            Fritz Hofmann     DEU   12.2 est. SILVER 1896
3                             Francis Lane     USA        12.6 BRONZE 1896
4                          Alajos Szokolyi     HUN   12.6 est. BRONZE 1896
5          400m Men              Tom Burke     USA        54.2   GOLD 1896
6                          Herbert Jamison     USA         n/a SILVER 1896
7                           Charles Gmelin     GBR         n/a BRONZE 1896
8          800m Men            Teddy Flack     AUS      2:11.0   GOLD 1896
9                            Nֳ¡ndor Dֳ¡ni     HUN 2:11.8 est. SILVER 1896
10                       Demitrios Golemis     GRE         n/a BRONZE 1896

Now if you look at the event column you can see some of the rows of the event field empty, this is the way the table is on the website, what I am looking for is the most efficient way to fill those blanks so at the end it should look like this

              Event       Athlete      Country      Result  Medal year
1          100m Men       Tom Burke       USA        12.0   GOLD 1896
2          100m Men       Fritz Hofmann   DEU   12.2 est. SILVER 1896
3          100m Men       Francis Lane    USA        12.6 BRONZE 1896
4          100m Men       Alajos Szokolyi HUN   12.6 est. BRONZE 1896
5          400m Men       Tom Burke       USA        54.2   GOLD 1896

Basically every time a field in the Event column is empty I need to fill it with the last value that was not empty. The column is saved in R as a factor and I know that technically I can do this using a for loop and going over all of the vector elements but this is very time consuming considering the fat that there are about 300000 rows in this table. I am hoping for something more efficient

Upvotes: 0

Views: 136

Answers (3)

WaltS
WaltS

Reputation: 5520

The fill function from the R Studio package tidyr is designed to do this kind of data cleaning. We'll use it with mutate from the dplyr package which will first replace missing values with NA. Your data is in data frame df Code looks like

library(dplyr)
library(tidyr)
df_fill <- df %>% mutate(Event = ifelse(Event == "", NA,Event)) %>%
                  fill(Event)

Upvotes: 1

lmo
lmo

Reputation: 38500

Here is a base R method using rle and inverse.rle:

# create run length encoding
temp <- rle(df$event)
# fill in missing values with previous values
temp$values[is.na(temp$values)] <- temp$values[which(is.na(temp$values))-1]

# fill in vector of data.frame
df$event <- inverse.rle(temp)

Note that this won't work if there is a missing value in the first position of the variable.

Upvotes: 0

Alex Ioannides
Alex Ioannides

Reputation: 1222

Here's a toy example of how the purrr package could be used to solve your problem, assuming the data is in a data.frame and the missing values are NA:

library(purrr)

df <- data.frame("event" = c(1, NA, 2, NA, 3, NA, 5), "other" = 1:7)

df
#     event other
# 1     1     1
# 2    NA     2
# 3     2     3
# 4    NA     4
# 5     3     5
# 6    NA     6
# 7     5     7


df$event <- accumulate(.x = df$event, .f = function(x, y) { if(is.na(y)) x else y })

df
#     event other
# 1     1     1
# 2     1     2
# 3     2     3
# 4     2     4
# 5     3     5
# 6     3     6
# 7     5     7

Upvotes: 1

Related Questions