Reputation: 11

I need to give a similarity score to the elements in my dataset

I have a dataset in R similar to a dummy as shown :

Apple-3
Apple-California-4
Apple-China-3
Samsung-2
Samsung-India-2
Sony-AG-1
Sony-4
Sony-USA-4

I need to combine them based on a similarity score as

Apple-10
Samsung-4
Sony-9

e.g.: Apple, Apple-China, Apple-California get combined into Apple and their values get summed up.

Is there a way to do that?

Upvotes: 1

Answers (3)

Sotos

Reputation: 51582

Here is another way of doing it by gsub and aggregate. Note that I converted it from factor to character beforehand.

d$names <- gsub("-.*", "", d$V1)
d$values <- as.numeric(gsub("[^\\d]", "", d$V1, perl = TRUE))
aggregate(values ~ names, d, sum)
#    names values
#1   Apple     10
#2 Samsung      4
#3    Sony      9

DATA

dput(d)
structure(list(V1 = c("Apple-3", "Apple-California-4", "Apple-China-3", 
"Samsung-2", "Samsung-India-2", "Sony-AG-1", "Sony-4", "Sony-USA-4"
), names = c("Apple", "Apple", "Apple", "Samsung", "Samsung", 
"Sony", "Sony", "Sony"), values = c(3, 4, 3, 2, 2, 1, 4, 4)), .Names = c("V1", 
"names", "values"), row.names = c(NA, -8L), class = "data.frame")

Upvotes: 1

BGA

Reputation: 561

This should really be a string manipulation exercise but I thought this could be a FUN challenge without using string functions.

So I saved your sample as a CSV file. Then used the dashes (-) as a separator for a data frame.

df <- read.csv('Manufacturers.csv', header = F, sep = '-')

This creates a data frame with 3 columns

       V1         V2 V3
1   Apple          3 NA
2   Apple California  4
3   Apple      China  3
4 Samsung          2 NA
5 Samsung      India  2
6    Sony         AG  1
7    Sony          4 NA
8    Sony        USA  4

Since V2 is a factor, convert it to numbers.

df$V2 <- as.numeric(as.character(df$V2))

At this point, V2 and V3 are a bunch of numbers with NAs. Let's convert those NAs to zeros.

df$V2[is.na(df$V2)] <- 0
df$V3[is.na(df$V3)] <- 0

Add V2 and V3 together to a new column. I called mine Quantity.

df$Quantity <-df$V2 + df$V3

Then sum the Quantity column.

aggregate(df$Quantity, by=list(Category=df$V1), FUN=sum)

And this is what I got:

  Category  x
1    Apple 10
2  Samsung  4
3     Sony  9

Happy coding!

-bg

Upvotes: 1

Alex Bădoi

Reputation: 830

You should separate the character bit from the score first:

# 2 rows one with ID and one with score
company <- as.matrix(c("Apple", "Apple-California", "Apple-China", "Samsung" ))
score   <- as.matrix(c(3, 4,3, 2))

# bind columns create a frame
data <- cbind(company, score)

# this will return which rows contain the word "Apple"

n <- grep("Apple", data[,1])

Also useful to know is how to subset a character vector in order to get rid of the extra bits

look at strsplit(), paste() and paste0() functions.

the first will help you decompose the text into individual characters. The later will help you paste things back together:

another easy one to use is substr("HEllo", 1,4) which will output characters 1 to 4 -> "Hell"

Upvotes: 0

I need to give a similarity score to the elements in my dataset

Answers (3)

Related Questions