jcdmb
jcdmb

Reputation: 3056

Matrix transformation and aggregation in R

I am starting development with R and I am still having "beginner problems" with the language. I would like to do the following:

  1. I have a matrix (data frame:=user) with ~900 columns, each of them is the name of a band (Nirvana, Green Day, Daft-Punk, etc.).
  2. In each row I have an user and the user's music taste (Nirvana = 10, Green Day=5, Daft Punkt=0)
  3. I would like to query another dataframe(:=artists - with the artist's music tags) and substitute the name of the bands by its Genre-Tag (Nirvana --> Rock, Green Day --> Rock, Daft-Punk --> Techno). There are ~120 Tags for music taste (120 < 900)
  4. And finally, I would like to "aggregate" the values over all columns to avoid duplicated columns. In the example from (3) - with the aggregation function "SUM" - the row would have only 2 entries and not 3: (Rock = 15, Techno=0)

Any clues on how to do that with R? Thanks in advance for any help!

Data:

user: pastebin.com/4gVe004T

artists: pastebin.com/dm7weLMG

Upvotes: 1

Views: 351

Answers (1)

MvG
MvG

Reputation: 60988

I have a matrix (data frame:=user) with ~900 columns, each of them is the name of a band (Nirvana, Green Day, Daft-Punk, etc.).
In each row I have an user and the user's music taste (Nirvana = 10, Green Day=5, Daft Punkt=0)

This is so-called “wide” format. It would be better for most tasks to reshape this to narrow format, i.e. to a single data.frame with two columns, one which identifies the user and another which identifies the band. There are several tools to do this, and several questions here on SO. Look for the tag in particular.

There also is a package called reshape which can help here. There the process I'm talking about is called “melting” the data.

I would like to query another dataframe(:=artists - with the artist's music tags) and substitute the name of the bands by its Genre-Tag (Nirvana --> Rock, Green Day --> Rock, Daft-Punk --> Techno). There are ~120 Tags for music taste (120 < 900)

You can use merge to combine multiple data frames, using the band name as merge key. This is the reason why you'd want the band names to be values, not column names.

And finally, I would like to "aggregate" the values over all columns to avoid duplicated columns. In the example from (3) - with the aggregation function "SUM" - the row would have only 2 entries and not 3: (Rock = 15, Techno=0)

When you use reshape to “cast” your data back to wide format, you can supply an aggregate function which will be used to combine values. You can use sum for that.

Upvotes: 2

Related Questions