Reputation: 85
I have a data frame like below with 96,000 lines and varying amounts of the same ID. I want select the rows with the same ID and then choose the Sales amount from the row with the earliest date.
ID Date1 Date2 Date3 Sales
------ ---------- ---------- ---------- -------
3351 7/18/18 1/8/2017 9/7/2016 $240
3351 9/15/14 $670
3351 4/5/2017 9/7/16
8222 6/6/2013 2/5/2008 $943
In this case for ID 3351 we would pick $670, because Date2 in the second dup is in 2014.
for(i in length(data))
{
if(duplicated(dat17[i,1]) == TRUE)
{
pmin(dat17[1,7:9], dat17[2,7:9])
}
}
I am trying to use a for loop to find the duplicates and compare, but I'm not sure how to use multiple rows given by the duplicated() function. pmin() only works for vectors and I need to use the data from the data frame.
Upvotes: 0
Views: 474
Reputation: 886968
We don't need to compare a logical expression to TRUE/FALSE. it is already a logical output in base R
(No external packages used)
i1 <- !duplicated(dat17[[1]])
Also, before doing the comparison, make sure the 'Date' columns are converted to Date
class
dat17[2:4] <- lapply(dat17[2:4], as.Date, format = "%m/%d/%y")
cbind(dat17[i1,], new = do.call(pmin, c(dat17[2:4][i1,], na.rm = TRUE)))
# ID Date1 Date2 Date3 Sales new
#1 3351 2018-07-18 2020-01-08 2020-09-07 $240 2018-07-18
#4 8222 <NA> 2020-06-06 2020-02-05 $943 2020-02-05
Note that the for
loop is just looping through one value length(data)
returns a single element. Also, not sure what is 'data' here when the next line should 'dat17'
dat17 <- structure(list(ID = c(3351L, 3351L, 3351L, 8222L), Date1 = c("7/18/18",
NA, "4/5/2017", NA), Date2 = c("1/8/2017", "9/15/14", "9/7/16",
"6/6/2013"), Date3 = c("9/7/2016", NA, NA, "2/5/2008"), Sales = c("$240",
"$670", NA, "$943")), class = "data.frame", row.names = c(NA,
-4L))
Upvotes: 1
Reputation: 1263
Here's one option:
library(tidyverse)
library(lubridate)
df <-
tibble(
id = c(3351, 3351, 8222, 8222),
date1 = c("2018-07-18", NA_character_, "2017-01-03", "2016-03-02"),
date2 = c("2016-03-04", "2015-03-02", NA_character_, "2017-04-05"),
sales = c(240, 670, NA_integer_, 300)
)
df
# A tibble: 4 x 4
id date1 date2 sales
<dbl> <chr> <chr> <dbl>
1 3351 2018-07-18 2016-03-04 240
2 3351 NA 2015-03-02 670
3 8222 2017-01-03 NA NA
4 8222 2016-03-02 2017-04-05 300
The first step is to gather the various date columns into a single column:
df %>%
gather(key = date_col, value = date, date1:date2)
# A tibble: 8 x 4
id sales date_col date
<dbl> <dbl> <chr> <chr>
1 3351 240 date1 2018-07-18
2 3351 670 date1 NA
3 8222 NA date1 2017-01-03
4 8222 300 date1 2016-03-02
5 3351 240 date2 2016-03-04
6 3351 670 date2 2015-03-02
7 8222 NA date2 NA
8 8222 300 date2 2017-04-05
Then you can group by the ID and filter to find the earliest date.
df %>%
gather(key = date_col, value = date, date1:date2) %>%
filter(!is.na(date), !is.na(sales)) %>%
mutate(date = ymd(date)) %>%
group_by(id) %>%
filter(date == min(date)) %>%
ungroup()
# A tibble: 2 x 4
id sales date_col date
<dbl> <dbl> <chr> <date>
1 8222 300 date1 2016-03-02
2 3351 670 date2 2015-03-02
Upvotes: 1
Reputation: 2011
One method is to use the dplyr
package which makes this sort of thing easier.
library(dplyr)
df <- read.table(text = "ID Date1 Date2 Date3 Sales
3351 7/18/18 1/8/2017 9/7/2016 $240
3351 NA 9/15/14 NA $670
3351 4/5/2017 9/7/16 NA NA
8222 NA 6/6/2013 2/5/2008 $943 ",
stringsAsFactors = FALSE,
header = TRUE) %>%
# make sure the date variables are in date format for sorting to work properly
mutate_at(vars(starts_with("Date")),lubridate::mdy)
df %>%
# calculate the minimum date for each row using pmin
mutate(min_date = pmin(Date1,Date2,Date3,na.rm = TRUE)) %>%
# arrange from lowest date to highest date
arrange(min_date) %>%
# for each ID
group_by(ID) %>%
# keep the first one
slice(1) %>%
# ungroup to allow future calculations
ungroup()
# A tibble: 2 x 6
# ID Date1 Date2 Date3 Sales min_date
# <int> <date> <date> <date> <chr> <date>
# 1 3351 NA 2014-09-15 NA $670 2014-09-15
# 2 8222 NA 2013-06-06 2008-02-05 $943 2008-02-05
Upvotes: 4