dnsko
dnsko

Reputation: 1047

R Split data efficiently for boxplot

I was busy creating graphs to compare data and was working on a boxplot in this case. I have IMDb data, and also 100k Movielens data (from here: http://grouplens.org/datasets/movielens/)

For IMDb it was rather easy to create these boxplots, the dataframe looked like this: enter image description here

For MovieLens however, the genres looks like this: enter image description here

How would I create a boxplot when there are multiple genres in this? Best case is to combine it into the IMDb boxplot that I have already, which looks like this:

enter image description here

Currently, the code for the IMDb one is like this:

  all_movies$Rating <- sapply(sapply(all_movies$Rating, as.character), as.numeric)
  output$boxplot <- renderPlot({
    p <- ggplot(all_movies) + geom_boxplot(aes(x = Genre, y = Rating))
    p
  })

How would this work for Movielens to create something similar?

Upvotes: 0

Views: 296

Answers (1)

Wave
Wave

Reputation: 1266

Gregor already suggested what I also think is the best solution:

# example df 

lens=data.frame(movie=c('A','B'),genre=c('Adventure|Animation','Comedy|Animation'),rating=8:9)

# create new columns

genres=unique(unlist(strsplit(as.character(lens$genre),"\\|")))
for(i in genres){
  lens$newcol=grepl(i,lens$genre)
  colnames(lens)[ncol(lens)]=i
}
lens$genre=NULL

# melt for ggplot

lens=melt(lens,id=c('movie','rating'))
lens=lens[lens$value==TRUE,]

ggplot(lens,aes(x=variable,y=rating)) + geom_boxplot()

If you want both movie databases to be on the same plot, you simply create the same structure for AMDB, add to both df a column with the name (ADMB$source="ADMB", lens$source="movielens") and rbind them (df=rbind(ADMB,movielens).

The plot would be:

ggplot(df,aes(x=variable,y=rating,col=source)) + geom_boxplot()

Upvotes: 1

Related Questions