Reputation: 47
I would like to obtain a data frame that contains two columns: 1. The different fruits (no duplicates) 2. The first date of the appearance of the specific fruit (i.e. kiwis)
fruits <- c("apples, oranges, pears, bananas",
"pineapples, mangos, guavas",
"bananas, apples, kiwis")
fruits<-as.data.frame(fruits)
fruits$date<-c( "12.8.16", "22.4.17", "12.9.16")
fruits[with(fruits, order(date)), ]
I tried to write a loop or use the match command. However, the unique string values are not recognized.
Thank you in advance! Jannis
Upvotes: 3
Views: 2243
Reputation: 269634
Here are some solutions:
1) strsplit/unnest/summarize This uses dplyr and tidyr. First convert date
column to "Date"
class and split the fruits
column producing a column where each cell contains a vector of fruits. unnest
that and find the minimums:
library(dplyr)
library(tidyr)
fruits %>%
mutate(date = as.Date(date, "%d.%m.%y"),
fruits = strsplit(as.character(fruits), ", ")) %>%
unnest %>%
group_by(fruits) %>%
summarize(date = min(date)) %>%
ungroup
giving:
# A tibble: 8 × 2
fruits date
<chr> <date>
1 apples 2016-08-12
2 bananas 2016-08-12
3 guavas 2017-04-22
4 kiwis 2016-09-12
5 mangos 2017-04-22
6 oranges 2016-08-12
7 pears 2016-08-12
8 pineapples 2017-04-22
1a) separate_rows/summarize This slightly shorter variation uses separate_rows
(replacing the strsplit
and unnest
lines with a single simpler command). It requires tidyr 0.5 or later. It gives the same result:
fruits %>%
mutate(date = as.Date(date, "%d.%m.%y")) %>%
separate_rows(fruits) %>%
group_by(fruits) %>%
summarize(date = min(date)) %>%
ungroup
2) strsplit/stack/aggregate This does not use any packages. First we split the fruits column and name the components of the resulting list, L
, with dates. Then we stack the list creating a data frame and rename the columns while also creating a true "Date"
class column. Finally we aggregate
to find the minimums.
L <- with(fruits, setNames(strsplit(as.character(fruits), ", "), as.Date(date,"%d.%m.%y")))
stk <- with(stack(L), data.frame(fruits = values, date = as.Date(ind)))
aggregate(date ~ fruits, stk, min)
giving this data.frame:
fruits date
1 apples 2016-08-12
2 bananas 2016-08-12
3 guavas 2017-04-22
4 kiwis 2016-09-12
5 mangos 2017-04-22
6 oranges 2016-08-12
7 pears 2016-08-12
8 pineapples 2017-04-22
Upvotes: 3
Reputation: 18425
I think this is what you are asking for.
fruits <- c("apples, oranges, pears, bananas",
"pineapples, mangos, guavas",
"bananas, apples, kiwis")
fruits<-as.data.frame(fruits,stringsAsFactors=FALSE) #probably easier for the fruits to be strings rather than factors
fruits$date<-as.Date(c( "12.8.16", "22.4.17", "12.9.16"),format="%d.%m.%y") #and set your dates to be Dates rather than strings (otherwise they will be sorted alphabetically)
fruits[with(fruits, order(date)), ]
#need to convert your df to one-fruit-per-row
fruits2 <- do.call(rbind, #this binds together the data frames created by the lapply loop
lapply(1:nrow(fruits), #loops through the rows of fruits df to create a list of data frames, each corresponding to one row
function(i) data.frame(fruit=trimws(strsplit((fruits$fruits),",")[[i]]), #splits your strings at commas, and trims off the whitespace
date=fruits$date[i],stringsAsFactors = FALSE))) #adds the date corresponding to each row
#finding the first appearance is easily done using dplyr
library(dplyr)
fruits3 <- fruits2 %>% group_by(fruit) %>% summarise(firstdate=min(date))
Or another approach would be to set up a data frame using the unique names of the fruits, and then find the first date using a grep
lookup for each fruit...
fruits <- fruits[order(fruits$date),]
firstfruits <- data.frame(fruit=unique(trimws(unlist(strsplit(fruits$fruits,",")))),stringsAsFactors = FALSE)
firstfruits$date <- do.call(c,lapply(firstfruits$fruit, function(F) fruits$date[grep(F,fruits$fruits)[1]]))
Upvotes: 0
Reputation: 99331
Here's an approach using the splitstackshape package, which uses the data.table package underneath. We can use cSplit()
to split the fruits
column at the comma, then use data.table syntax to take the minimum date
.
library(splitstackshape)
## create the long data frame from the split 'fruits' column
DT <- cSplit(fruits, "fruits", sep = ",", direction = "long")
## convert the 'date' column to date class and take the minimum row
DT[, .(date = min(as.IDate(date, "%d.%m.%y"))), by = fruits]
# fruits date
# 1: apples 2016-08-12
# 2: oranges 2016-08-12
# 3: pears 2016-08-12
# 4: bananas 2016-08-12
# 5: pineapples 2017-04-22
# 6: mangos 2017-04-22
# 7: guavas 2017-04-22
# 8: kiwis 2016-09-12
Upvotes: 1