Reputation: 1
I am attempting to summarize the results of an Excel (.xlsx) file in R-Studio. I need to get from this:
*Date Site URL*
Jan 1 Amazon amazon.com
Jan 1 Blackberry
Jan 2 Google google.com
Jan 2 G-Mail google.com
Jan 3 Microsoft microsoft.com
Jan 1 Ohio
Jan 3 X-Box microsoft.com
Jan 3 YouTube youtube.com
Jan 3 Yahoo yahoo.com
to this:
*Date Site URL*
Jan 1 Amazon amazon.com
Jan 1 Blackberry
Jan 2 Google|G-Mail google.com
Jan 3 Microsoft|X-Box microsoft.com
Jan 1 Ohio
Jan 3 YouTube youtube.com
Jan 3 Yahoo yahoo.com
Essentially, I need to summarize all other columns by the "URL" value, but it is crucial to leave results with a blank value in the "URL" column unaffected by the summarization.
What I have now is:
URLSummarize <-(URLSummarize %>%
group_by(`URL`) %>%
summarize(across(everything(), ~ if (nzchar(URL[1])) paste(unique(.), collapse = "|") else .)) %>%
ungroup())
dput(URLSummarize)
However, this is making those values that are blank in the "URL" column to be condensed so that it looks like this via a dput(URLSummarize):
Date Site URL
2022-01-01 Amazon amazon.com
2022-01-01 Blackberry NA
2022-01-02 Google google.com
2022-01-02 G-Mail google.com
2022-01-03 Microsoft microsoft.com
2022-01-01 Ohio NA
2022-01-03 X-Box microsoft.com
2022-01-03 YouTube youtube.com
2022-01-03 Yahoo yahoo.com
but I am still getting this when I write it to an xlsx file:
*Date Site URL*
Jan 1 Amazon amazon.com
Jan 1 Blackberry|Ohio
Jan 2 Google|G-Mail google.com
Jan 3 Microsoft|X-Box microsoft.com
Jan 3 YouTube youtube.com
Jan 3 Yahoo yahoo.com
Upvotes: 0
Views: 76
Reputation: 160407
You can use an embedded if
. (I've edited this to reference your newer data, and because we cannot always see it in your data, I made the blanks an actual one-space string.)
isblank <- !nzchar(trimws(dat$URL))
isblank
# [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
dat[!isblank,] <- data.frame(
lapply(dat, function(z) ave(z[!isblank], dat$URL[!isblank],
FUN = function(z) paste(unique(z), collapse = "|"))
)
)
dat[!duplicated(dat),]
# Date Site URL
# 1 Jan 1 Amazon amazon.com
# 2 Jan 1 Blackberry
# 3 Jan 2 Google|G-Mail google.com
# 5 Jan 3 Microsoft|X-Box microsoft.com
# 6 Jan 1 Ohio
# 8 Jan 3 YouTube youtube.com
# 9 Jan 3 Yahoo yahoo.com
(Don't let the row names fool you, seven rows.)
library(dplyr)
dat %>%
group_by(URL) %>%
summarize(across(everything(), ~ if (nzchar(trimws(URL[1]))) paste(unique(.), collapse = "|") else .)) %>%
ungroup()
# # A tibble: 7 x 3
# URL Date Site
# <chr> <chr> <chr>
# 1 " " Jan 1 Blackberry
# 2 " " Jan 1 Ohio
# 3 "amazon.com" Jan 1 Amazon
# 4 "google.com" Jan 2 Google|G-Mail
# 5 "microsoft.com" Jan 3 Microsoft|X-Box
# 6 "yahoo.com" Jan 3 Yahoo
# 7 "youtube.com" Jan 3 YouTube
Or if it's just one column, you can simplify it a little:
dat %>%
group_by(URL) %>%
summarize(Site = if (nzchar(trimws(URL[1]))) paste(unique(Site), collapse = "|") else Site) %>%
ungroup()
library(data.table)
DT <- as.data.table(dat) # setDT(dat) is generally preferred
DT[, lapply(.SD, function(z) if (nzchar(trimws(URL[1]))) paste(unique(z), collapse = "|") else z), by = .(URL)]
# URL Date Site
# <char> <char> <char>
# 1: amazon.com Jan 1 Amazon
# 2: Jan 1 Blackberry
# 3: Jan 1 Ohio
# 4: google.com Jan 2 Google|G-Mail
# 5: microsoft.com Jan 3 Microsoft|X-Box
# 6: youtube.com Jan 3 YouTube
# 7: yahoo.com Jan 3 Yahoo
Data
dat <- structure(list(Date = c("Jan 1", "Jan 1", "Jan 2", "Jan 2", "Jan 3", "Jan 1", "Jan 3", "Jan 3", "Jan 3"), Site = c("Amazon", "Blackberry", "Google", "G-Mail", "Microsoft", "Ohio", "X-Box", "YouTube", "Yahoo"), URL = c("amazon.com", " ", "google.com", "google.com", "microsoft.com", " ", "microsoft.com", "youtube.com", "yahoo.com")), row.names = c(NA, -9L), class = "data.frame")
Upvotes: 1
Reputation: 56004
Add unique dummy missing values, then group by paste using aggregate and toString, then convert missing values into blanks ""
:
# add a unique dummy "missing" URL
ix <- which(dat$URL == "")
dat[ ix, "URL" ] <- paste0("missing", ix)
# group by paste
res <- aggregate(Site ~ URL, dat, toString)
# convert missing to blanks
res[ grepl("missing", res$URL), "URL" ] <- ""
res
# URL Site
# 1 amazon.com Amazon
# 2 google.com Google, G-Mail
# 3 microsoft.com Microsoft, X-Box
# 4 Blackberry
# 5 Ohio
# 6 yahoo.com Yahoo
# 7 youtube.com YouTube
Upvotes: 1