Reputation: 11
I have a dataset in R similar to a dummy as shown :
Apple-3
Apple-California-4
Apple-China-3
Samsung-2
Samsung-India-2
Sony-AG-1
Sony-4
Sony-USA-4
I need to combine them based on a similarity score as
Apple-10
Samsung-4
Sony-9
e.g.: Apple, Apple-China, Apple-California
get combined into Apple
and their values get summed up.
Is there a way to do that?
Upvotes: 1
Views: 65
Reputation: 51582
Here is another way of doing it by gsub
and aggregate
. Note that I converted it from factor
to character
beforehand.
d$names <- gsub("-.*", "", d$V1)
d$values <- as.numeric(gsub("[^\\d]", "", d$V1, perl = TRUE))
aggregate(values ~ names, d, sum)
# names values
#1 Apple 10
#2 Samsung 4
#3 Sony 9
DATA
dput(d)
structure(list(V1 = c("Apple-3", "Apple-California-4", "Apple-China-3",
"Samsung-2", "Samsung-India-2", "Sony-AG-1", "Sony-4", "Sony-USA-4"
), names = c("Apple", "Apple", "Apple", "Samsung", "Samsung",
"Sony", "Sony", "Sony"), values = c(3, 4, 3, 2, 2, 1, 4, 4)), .Names = c("V1",
"names", "values"), row.names = c(NA, -8L), class = "data.frame")
Upvotes: 1
Reputation: 561
This should really be a string manipulation exercise but I thought this could be a FUN challenge without using string functions.
So I saved your sample as a CSV file. Then used the dashes (-) as a separator for a data frame.
df <- read.csv('Manufacturers.csv', header = F, sep = '-')
This creates a data frame with 3 columns
V1 V2 V3
1 Apple 3 NA
2 Apple California 4
3 Apple China 3
4 Samsung 2 NA
5 Samsung India 2
6 Sony AG 1
7 Sony 4 NA
8 Sony USA 4
Since V2 is a factor, convert it to numbers.
df$V2 <- as.numeric(as.character(df$V2))
At this point, V2 and V3 are a bunch of numbers with NAs. Let's convert those NAs to zeros.
df$V2[is.na(df$V2)] <- 0
df$V3[is.na(df$V3)] <- 0
Add V2 and V3 together to a new column. I called mine Quantity.
df$Quantity <-df$V2 + df$V3
Then sum the Quantity column.
aggregate(df$Quantity, by=list(Category=df$V1), FUN=sum)
And this is what I got:
Category x
1 Apple 10
2 Samsung 4
3 Sony 9
Happy coding!
-bg
Upvotes: 1
Reputation: 830
You should separate the character bit from the score first:
# 2 rows one with ID and one with score
company <- as.matrix(c("Apple", "Apple-California", "Apple-China", "Samsung" ))
score <- as.matrix(c(3, 4,3, 2))
# bind columns create a frame
data <- cbind(company, score)
# this will return which rows contain the word "Apple"
n <- grep("Apple", data[,1])
Also useful to know is how to subset a character vector in order to get rid of the extra bits
look at strsplit(),
paste()
and paste0()
functions.
the first will help you decompose the text into individual characters. The later will help you paste things back together:
another easy one to use is substr("HEllo", 1,4)
which will output characters 1 to 4 -> "Hell"
Upvotes: 0