masber
masber

Reputation: 3067

Merge rows from same dataframe

I was wondering if it would be possible to merge different rows on a dataframe if they have 1 field in common:

input:

df = rbind(c("01/01/2016",01:02:30,"100","character(0)","file A"),
           c("02/01/2016",9:02:30,"character(0)", 3, "file A"),
           c("02/01/2016",8:30:30,"200","character(0)","file B"),
           c("03/01/2016",8:25:30,"50","character(0)","file C"),
           c("04/01/2016",17:20:30,"character(0)","600","file B"))

output:

df = rbind(c(01/01/2016,01:02:30,"100",3,"file A"),
           c(02/01/2016,8:30:30,"200",600,"file B"),
           c(03/01/2016,8:25:30,"50","character(0)","file C"))

So as yo can see we merge the rows according to last value (file A, file B or file C). I need to keep the earliest of the dates. eg for "file A" we have 2 dates 01/01/2016 and 02/01/2016, we want to keep the we won't merge more than 2 rows per value

We want to keep the earliest date

Upvotes: 0

Views: 692

Answers (1)

niczky12
niczky12

Reputation: 5063

Based on your comments you want to find the first instance (ordered by one column) of a non-missing value for each column based on a grouping column (in your case the "file A/B/C" column).

First you will have to clean your data a bit. Your data load step is buggy due to some misplaced quotation marks around the timestamps. Also, I'm assuming you want to represent missing values with the character(0) values. If so use NAs. Here is the data initialisation and cleaning step:

# prepare your data
df = data.frame(V1 = c("01/01/2016 01:02:30","02/01/2016 9:02:30","02/01/2016 8:30:30",
                       "03/01/2016 8:25:30","04/01/2016 17:20:30"),
                V2 = c("100","character(0)","200","50","character(0)"),
                V3 = c("character(0)", "3", "character(0)","character(0)", "600"),
                V4 = c("file A", "file A", "file B", "file C", "file B"))

# replace the character(0)s with NAs as they are missing values
df[df == "character(0)"] <- NA

# convert character dates to time
df$V1 <- strptime(as.character(df[ ,1]), format = "%d/%m/%Y %H:%M:%S")

I've named the columns V1..4 but you probably want some more descriptive names. To get what you need you'll to fill up the missing values for of the columns by using the zoo package's na.locf() function. To eliminate the cross contamination of data across different values of V4 column I'm looping through the data. (there might be a better solution to this...) Here is a function to do your custom row merging:

custom_row_merge <- function(df,
                             sort_by,
                             group_by){

    # sort by dates in decreasing order
    df <- df[order(df[,group_by], df[,sort_by]), ]

    # select the columns to merge
    columns_to_merge <- names(df)[!(names(df) %in% c(sort_by, group_by))]

    # fill data for each unique value of group by column
    for (file_type in unique(df[, group_by])){

        row_indices <- (df[,group_by] == file_type)

        # fill missing values for each column that is not group by or sort by
        for (column_name in columns_to_merge){

            df[row_indices, column_name] <- na.locf(df[row_indices, column_name],
                                                    na.rm = F,
                                                    fromLast = T)
        }    

    }

    # get first occurence of each file, now with the filled values
    return(df[!duplicated(df[, group_by]), ])

}

Here is the original data frame:

> df
                   V1   V2   V3     V4
1 2016-01-01 01:02:30  100 <NA> file A
2 2016-01-02 09:02:30 <NA>    3 file A
3 2016-01-02 08:30:30  200 <NA> file B
4 2016-01-03 08:25:30   50 <NA> file C
5 2016-01-04 17:20:30 <NA>  600 file B

And the one produced by the function, matching what you describe in your question:

> custom_row_merge(df, "V1", "V4")
                   V1  V2   V3     V4
1 2016-01-01 01:02:30 100    3 file A
3 2016-01-02 08:30:30 200  600 file B
4 2016-01-03 08:25:30  50 <NA> file C

You can of course fill the missing values with characer(0) values if you want to.

Upvotes: 2

Related Questions