stats_noob
stats_noob

Reputation: 5925

R: Automatically Producing Histograms

I am using the R programming language. I created the following data set for this example:

var_1 <- rnorm(1000,10,10)
var_2 <- rnorm(1000, 5, 5)
var_3 <- rnorm(1000, 6,18)

favorite_food <- c("pizza","ice cream", "sushi", "carrots", "onions", "broccoli", "spinach", "artichoke", "lima beans", "asparagus", "eggplant", "lettuce", "cucumbers")
favorite_food <-  sample(favorite_food, 1000, replace=TRUE, prob=c(0.5, 0.45, 0.04, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001))


response <- c("a","b")
response <- sample(response, 1000, replace=TRUE, prob=c(0.3, 0.7))


data = data.frame( var_1, var_2, var_3, favorite_food, response)

data$favorite_food = as.factor(data$favorite_food)
data$response = as.factor(data$response)

From here, I want to make histograms for the two categorical variables in this data set and put them on the same page:

#make histograms and put them on the same page (note: I don't know why the "par(mfrow = c(1,2))" statement is not working)
par(mfrow = c(1,2))

histogram(data$response, main = "response"))

histogram(data$favorite_food, main = "favorite food"))

enter image description here

My question : Is it possibly to automatically produce histograms for all categorical variables (without manually writing the "histogram()" statement for each variable) in a given data set and print them on the same page? Is it better to the use the "ggplot2" library instead for this problem ?

I can manually write the "histogram()" statement for each individual categorical variables in the data set, but I was looking for a quicker way to do this. Is it possible to do this with a "for loop"?

Thanks

Upvotes: 0

Views: 515

Answers (4)

Sinh Nguyen
Sinh Nguyen

Reputation: 4497

Here is a try using cowplot & ggplot2

library(ggplot2)
library(dplyr)
library(foreach)
library(cowplot)

list_variables <- c("response", "favorite_food")
all_plot <- foreach(current_var = c(list_variables)) %do% {
  # need to do this to avoid ggplot reference to same summary data afterward.
  data_summary_name <- paste0(current_var, "_summary")
  eval(substitute(
    {
      graph_data <- data %>%
        group_by(!!sym(current_var)) %>%
        summarize(count = n(), .groups = "drop") %>%
        mutate(share = count / sum(count))
      plot <- ggplot(graph_data) +
        geom_bar(mapping = aes(x = !!sym(current_var), y = share), width = 1,
          fill = "#00FFFF", color = "#000000", stat = "identity") +
        scale_y_continuous(labels = scales::percent) +
        ggtitle(current_var) + ylab("Perecent of Total") +
        theme_bw()
    }, list(graph_data = as.name(data_summary_name))
  )) 
  return(plot)
}

plot_grid(plotlist = all_plot, ncol = 2)

Note: For reference about why I use eval & substitue you can reference to this question on ggplot2 generate same plot for different variables in a for loop

Using facet_wrap as approach similar to QuishSwash with data calculated in share instead

list_variables <- c("response", "favorite_food")
# Calculate share for choosen variables defined in list_variables 
# You can adjust by having some variables selection based on some condition
summary_df <- bind_rows(foreach(current_var = c(list_variables)) %do% {
  data %>%
    group_by(variable = !!sym(current_var)) %>%
    summarize(count = n(), .groups = "drop") %>%
    mutate(share = count / sum(count),
      variable_name = current_var)
})

ggplot(summary_df) +
  geom_bar(
    aes(x = variable, y = share),
    fill = "#00FFFF", color = "#000000", stat = "identity") +
  facet_wrap(~variable_name, scales = "free") +
  scale_y_continuous(labels = scales::percent) +
  theme_bw()

Created on 2021-04-29 by the reprex package (v2.0.0)

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389135

Here's a base R alternative using barplot in for loop :

cols <- names(data)[sapply(data, is.factor)]


#This would need some manual adjustment if number of columns increase
par(mfrow = c(1,length(cols))) 

for(i in cols) {
  barplot(table(data[[i]]), main = i)
}

enter image description here

Upvotes: 3

Scransom
Scransom

Reputation: 3335

A ggplot2/tidyverse solution is to lengthen each column into data and then use faceting to plot them all in the same page:

(with edit to plot only factor variables)

factor_vars <- sapply(data, is.factor)

varnames <- names(data)

deselect_not_factors <- varnames[!factor_vars]

library(tidyr)
library(ggplot2)

data_long <- data %>%
  pivot_longer(
    cols = -deselect_not_factors,
    names_to = "category",
    values_to = "value"
  )

ggplot(data_long) +
  geom_bar(
    aes(x = value)
  ) +
  facet_wrap(~category, scales = "free")

enter image description here

Upvotes: 4

Fadel Megahed
Fadel Megahed

Reputation: 116

As an alternative, you can capitalize on the fantastic DataExplorer package.

Note that histograms are for continuous variables and hence, you wanted to create bar plots for your categorical variables. This can be done as follows:

if(require(DataExplorer)==FALSE) install.packages("DataExplorer"); library(DataExplorer)
DataExplorer::plot_histogram(data) # plots histograms for continuous variables
DataExplorer::plot_bar(data) # bar plots for categorical variables

Please refer to the package manual for more details.

Upvotes: 2

Related Questions