Reputation: 43

Merging two rows of data in R based on rules

I have merged two data frames using bind_rows. I have a situation where I have two rows of data as for example below:

Page Path                           Page Title             Byline      Pageviews 
/facilities/when-lighting-strikes      NA                    NA           668
/facilities/when-lighting-strikes   When Lighting Strikes  Tom Jones       NA

When I have these type of duplicate page paths I'd like to merge the identical page paths, eliminate the two NA's in the first row keeping the page title (When Lighting Strikes) and Byline (Tom Jones) and then keep the pageviews result of 668 from the first row. Somehow it seems that I need

to identify the duplicate pages paths
look to see if there are different titles and bylines; remove NAs
keep the row with the pageview result; remove the NA row

Is there a way I can do this in R dplyr? Or is there a better way?

Upvotes: 3

Answers (5)

micstr

Reputation: 5206

An alternative approach using fill. Using tidyverse 1.3.0+ with dplyr 0.8.5+, you can use fill to fill in missing values.

See this for more information https://tidyr.tidyverse.org/reference/fill.html

DATA Thanks Alistaire

df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"), 
        PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
        Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
        Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
    "Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
    -2L))

# A tibble: 2 x 4
# Groups:   PagePath [1]
  PagePath                          PageTitle             Byline    Pageviews
  <fct>                             <fct>                 <fct>         <int>
1 /facilities/when-lighting-strikes NA                    NA              668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones        NA

CODE

I just did this for PageTitle but you can repeat fill to do it for other columns. (dplyr gurus might have a smarter way to do all 3 columns at once). If you have ordered data like dates, then you can set .direction to be just down for example (look at past data).

df.new <- df %>% group_by(PagePath) 
             %>% fill(PageTitle, .direction = "updown")

which gives you

# A tibble: 2 x 4
# Groups:   PagePath [1]
  PagePath                          PageTitle             Byline    Pageviews
  <fct>                             <fct>                 <fct>         <int>
1 /facilities/when-lighting-strikes When Lighting Strikes NA              668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones        NA

Once you have all the NAs cleaned up then you can use distinct or rank to get your final summarised dataframe.

Upvotes: 0

akrun

Reputation: 887911

Here is an option using data.table and complete.cases. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'PathPath', loop through the columns of the dataset (lapply(.SD, ..) and remove the NA elements with complete.cases. The complete.cases returns a logical vector and can be used for subsetting. According to this, complete.cases usage is much more faster than na.omit and coupled with data.table it would increase the efficiency.

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[complete.cases(x)]), by = PagePath]
#                     PagePath             PageTitle    Byline Pageviews
#1: /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

data

df <- structure(list(PagePath = structure(c(1L, 1L), 
 .Label = "/facilities/when-lighting-strikes", class = "factor"),   
    PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
    Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
    Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
-2L))

Upvotes: 0

eminik

Reputation: 145

Another way to do this (similar to a previous solutions that uses dplyr) would be:

  df %>% group_by(PagePath) %>% 
  dplyr::summarize(PageTitle = paste(na.omit(PageTitle)),
                   Byline = paste(na.omit(Byline)),
                   Pageviews =paste(na.omit(Pageviews)))

Upvotes: 0

Arun kumar mahesh

Reputation: 2359

Use replace function in for loop

for(i in unique(df$Page_Path)){
  df$Pageviews[df$Page_Path==i] <- replace(df$Pageviews[df$Page_Path==i],is.na(df$Pageviews[df$Page_Path==i]),
    df$Pageviews[!is.na(df$Pageviews[df$Page_Path==i])])
}

df <- subset(df, !is.na(Page_Title))

print(df)

                          Page_Path            Page_Title    Byline Pageviews
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

Upvotes: 2

alistaire

Reputation: 43354

A simple solution:

library(dplyr)

df %>% group_by(PagePath) %>% summarise_each(funs(na.omit))
# Source: local data frame [1 x 4]
# 
#                            PagePath             PageTitle    Byline Pageviews
#                              (fctr)                (fctr)    (fctr)     (int)
# 1 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

If your data is more complicated, you may need a more robust approach.

Data

df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"), 
        PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
        Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
        Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
    "Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
    -2L))

Upvotes: 3

Merging two rows of data in R based on rules

Answers (5)

data

Data

Related Questions