Reputation: 3067
I was wondering if it would be possible to merge different rows on a dataframe if they have 1 field in common:
input:
df = rbind(c("01/01/2016",01:02:30,"100","character(0)","file A"),
c("02/01/2016",9:02:30,"character(0)", 3, "file A"),
c("02/01/2016",8:30:30,"200","character(0)","file B"),
c("03/01/2016",8:25:30,"50","character(0)","file C"),
c("04/01/2016",17:20:30,"character(0)","600","file B"))
output:
df = rbind(c(01/01/2016,01:02:30,"100",3,"file A"),
c(02/01/2016,8:30:30,"200",600,"file B"),
c(03/01/2016,8:25:30,"50","character(0)","file C"))
So as yo can see we merge the rows according to last value (file A, file B or file C). I need to keep the earliest of the dates. eg for "file A" we have 2 dates 01/01/2016 and 02/01/2016, we want to keep the we won't merge more than 2 rows per value
We want to keep the earliest date
Upvotes: 0
Views: 692
Reputation: 5063
Based on your comments you want to find the first instance (ordered by one column) of a non-missing value for each column based on a grouping column (in your case the "file A/B/C" column).
First you will have to clean your data a bit. Your data load step is buggy due to some misplaced quotation marks around the timestamps. Also, I'm assuming you want to represent missing values with the character(0)
values. If so use NA
s. Here is the data initialisation and cleaning step:
# prepare your data
df = data.frame(V1 = c("01/01/2016 01:02:30","02/01/2016 9:02:30","02/01/2016 8:30:30",
"03/01/2016 8:25:30","04/01/2016 17:20:30"),
V2 = c("100","character(0)","200","50","character(0)"),
V3 = c("character(0)", "3", "character(0)","character(0)", "600"),
V4 = c("file A", "file A", "file B", "file C", "file B"))
# replace the character(0)s with NAs as they are missing values
df[df == "character(0)"] <- NA
# convert character dates to time
df$V1 <- strptime(as.character(df[ ,1]), format = "%d/%m/%Y %H:%M:%S")
I've named the columns V1..4
but you probably want some more descriptive names. To get what you need you'll to fill up the missing values for of the columns by using the zoo
package's na.locf()
function. To eliminate the cross contamination of data across different values of V4
column I'm looping through the data. (there might be a better solution to this...)
Here is a function to do your custom row merging:
custom_row_merge <- function(df,
sort_by,
group_by){
# sort by dates in decreasing order
df <- df[order(df[,group_by], df[,sort_by]), ]
# select the columns to merge
columns_to_merge <- names(df)[!(names(df) %in% c(sort_by, group_by))]
# fill data for each unique value of group by column
for (file_type in unique(df[, group_by])){
row_indices <- (df[,group_by] == file_type)
# fill missing values for each column that is not group by or sort by
for (column_name in columns_to_merge){
df[row_indices, column_name] <- na.locf(df[row_indices, column_name],
na.rm = F,
fromLast = T)
}
}
# get first occurence of each file, now with the filled values
return(df[!duplicated(df[, group_by]), ])
}
Here is the original data frame:
> df
V1 V2 V3 V4
1 2016-01-01 01:02:30 100 <NA> file A
2 2016-01-02 09:02:30 <NA> 3 file A
3 2016-01-02 08:30:30 200 <NA> file B
4 2016-01-03 08:25:30 50 <NA> file C
5 2016-01-04 17:20:30 <NA> 600 file B
And the one produced by the function, matching what you describe in your question:
> custom_row_merge(df, "V1", "V4")
V1 V2 V3 V4
1 2016-01-01 01:02:30 100 3 file A
3 2016-01-02 08:30:30 200 600 file B
4 2016-01-03 08:25:30 50 <NA> file C
You can of course fill the missing values with characer(0)
values if you want to.
Upvotes: 2