Reputation: 43
I have merged two data frames using bind_rows. I have a situation where I have two rows of data as for example below:
Page Path Page Title Byline Pageviews
/facilities/when-lighting-strikes NA NA 668
/facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
When I have these type of duplicate page paths I'd like to merge the identical page paths, eliminate the two NA's in the first row keeping the page title (When Lighting Strikes) and Byline (Tom Jones) and then keep the pageviews result of 668 from the first row. Somehow it seems that I need
Is there a way I can do this in R dplyr? Or is there a better way?
Upvotes: 3
Views: 2608
Reputation: 5206
An alternative approach using fill. Using tidyverse
1.3.0+ with dplyr
0.8.5+, you can use fill to fill in missing values.
See this for more information https://tidyr.tidyverse.org/reference/fill.html
DATA Thanks Alistaire
df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
# A tibble: 2 x 4
# Groups: PagePath [1]
PagePath PageTitle Byline Pageviews
<fct> <fct> <fct> <int>
1 /facilities/when-lighting-strikes NA NA 668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
CODE
I just did this for PageTitle but you can repeat fill to do it for other columns. (dplyr
gurus might have a smarter way to do all 3 columns at once). If you have ordered data like dates, then you can set .direction
to be just down for example (look at past data).
df.new <- df %>% group_by(PagePath)
%>% fill(PageTitle, .direction = "updown")
which gives you
# A tibble: 2 x 4
# Groups: PagePath [1]
PagePath PageTitle Byline Pageviews
<fct> <fct> <fct> <int>
1 /facilities/when-lighting-strikes When Lighting Strikes NA 668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
Once you have all the NAs cleaned up then you can use distinct or rank to get your final summarised dataframe.
Upvotes: 0
Reputation: 887911
Here is an option using data.table
and complete.cases
. We convert the 'data.frame' to 'data.table' (setDT(df)
), grouped by 'PathPath', loop through the columns of the dataset (lapply(.SD, ..
) and remove the NA elements with complete.cases
. The complete.cases
returns a logical vector
and can be used for subsetting. According to this, complete.cases
usage is much more faster than na.omit
and coupled with data.table
it would increase the efficiency.
library(data.table)
setDT(df)[, lapply(.SD, function(x) x[complete.cases(x)]), by = PagePath]
# PagePath PageTitle Byline Pageviews
#1: /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
df <- structure(list(PagePath = structure(c(1L, 1L),
.Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
Upvotes: 0
Reputation: 145
Another way to do this (similar to a previous solutions that uses dplyr) would be:
df %>% group_by(PagePath) %>%
dplyr::summarize(PageTitle = paste(na.omit(PageTitle)),
Byline = paste(na.omit(Byline)),
Pageviews =paste(na.omit(Pageviews)))
Upvotes: 0
Reputation: 2359
Use replace function in for loop
for(i in unique(df$Page_Path)){
df$Pageviews[df$Page_Path==i] <- replace(df$Pageviews[df$Page_Path==i],is.na(df$Pageviews[df$Page_Path==i]),
df$Pageviews[!is.na(df$Pageviews[df$Page_Path==i])])
}
df <- subset(df, !is.na(Page_Title))
print(df)
Page_Path Page_Title Byline Pageviews
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
Upvotes: 2
Reputation: 43354
A simple solution:
library(dplyr)
df %>% group_by(PagePath) %>% summarise_each(funs(na.omit))
# Source: local data frame [1 x 4]
#
# PagePath PageTitle Byline Pageviews
# (fctr) (fctr) (fctr) (int)
# 1 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
If your data is more complicated, you may need a more robust approach.
df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
Upvotes: 3