theBigCheese88
theBigCheese88

Reputation: 452

Plot several categorical data on a line graph or scatter plot of date by count

I have some data similar to this:

  year        car_type
    1       1993     sport
    2       1994     sport
    3       1945     family
    4       1955     off-road
    5       1998     sport
    6       1966     off-road
    7       2001     super
    8       1999     super
    9       2010     super
    10      1988     off-road
    11      1988     off-road
    12      1988     sport
    13      2014     sport
    14      2056     super
    15      2022     family
    16      2022     family
    17      2008     family
    18      2001     off-road
    19      2018     super
    20      2008     family
    21      2020     sport
    22      2013     sport
    23      2014     super
    24      2015     off-road
    25      2014     off-road
    26      2013     sport
    27      2013     super
    28      2014     super
    29      2020     off-road
    30      2020     sport

note: both year and car_type can occur more than once.

I want to plot a line graph or scatter plot with x axis being the year and y axis being the number of times a car occurs in that year(any car_type occurs).

I can gather how to plot multiple lines from here https://r-graphics.org/recipe-line-graph-multiple-line however I don't know how to plot a line graph of one variable and its occurrences. So x axis be the date and y being the number of times that date would occur. Same with scatter plot.

I can do the same concept in a stacked bar chart: enter image description here

However that doesn't show the occurrence of these cars over time. Any help would be appreciated.

Upvotes: 0

Views: 1353

Answers (3)

TarJae
TarJae

Reputation: 79246

Maybe you are interested in this kind of solution?

library(tidyverse)
library(lubridate) # for working with dates
library(scales)   # to access breaks/formatting functions

 df %>%
  group_by(year) %>% 
  dplyr::count(car_type) %>% 
  dplyr::summarise(N = sum(n)) %>% 
  arrange(year) %>%  
  mutate(year = lubridate::ymd(year, truncated = 2L)) %>% 
  ggplot +
  aes(x=year, y=N) +
  geom_line( color="steelblue", size=1) + 
  scale_x_date(breaks=date_breaks("5 year"), date_labels = "%Y") +
  geom_point() +
  xlab("") +
  theme_bw() +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
   xlab("year") + 
   ylab("Cars(N)") +
  ylim(0,6) +
   ggtitle("Cars per year") 
   

enter image description here data:

df <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
       11, 12, 13, 14, 15, 16, 17, 18, 19, 
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30),
year = c(1993, 1994, 1945, 1955, 1998, 1966, 2001, 1999,
         2010, 1988, 1988, 1988, 2014, 2056, 2022, 2022, 2008, 2001, 2018, 
         2008, 2020, 2013, 2014, 2015, 2014, 2013, 2013, 2014, 2020, 2020), 
car_type = c("sport", "sport", "family", "off-road", "sport", 
             "off-road", "super", "super", "super", "off-road", "off-road", 
             "sport", "sport", "super", "family", "family", "family", "off-road", 
             "super", "family", "sport", "sport", "super", "off-road", "off-road",
             "sport", "super", "super", "off-road", "sport"))

Upvotes: 1

Peter
Peter

Reputation: 12739

This is a version based on your question for a scatter plot graph using the data in the question.

library(ggplot2)
library(dplyr)

The problem with a simple scatter plot is that as you have a discrete axis points will overlap as in the first example.

ggplot(df)+
  geom_point(aes(year, car)) 

To make the graph more meaningful you can summarise the data by count of cars for a given category and year as follows:


df1 <- 
  df %>%
  group_by(year, car) %>% 
  summarise(count = n())
 
ggplot(df1)+
  geom_point(aes(year, car, size = count))+
  scale_size_continuous(breaks = unique(df1$count))

data

df <- structure(list(id = 2:30, year = c(1994L, 1945L, 1955L, 1998L, 
                                         1966L, 2001L, 1999L, 2010L, 1988L, 1988L, 1988L, 2014L, 2056L, 
                                         2022L, 2022L, 2008L, 2001L, 2018L, 2008L, 2020L, 2013L, 2014L, 
                                         2015L, 2014L, 2013L, 2013L, 2014L, 2020L, 2020L), car = c("sport", 
                                                                                                   "family", "off-road", "sport", "off-road", "super", "super", 
                                                                                                   "super", "off-road", "off-road", "sport", "sport", "super", "family", 
                                                                                                   "family", "family", "off-road", "super", "family", "sport", "sport", 
                                                                                                   "super", "off-road", "off-road", "sport", "super", "super", "off-road", 
                                                                                                   "sport")), class = "data.frame", row.names = c(NA, -29L))

Created on 2021-04-10 by the reprex package (v2.0.0)

Upvotes: 1

teunbrand
teunbrand

Reputation: 38063

In ggplot2, layers have two important components: a geom and a stat. Some layers, like geom_bar() have automatically attached non-identity stat parts, in this case the stat_count(). If you want to replicate geom_bar() behaviour with geom_line(), you need to supply the right stat to the layer.

library(ggplot2)

# Assuming 'data' is a data.frame with the data you've posted
ggplot(data, aes(year, colour = car_type)) +
  geom_line(stat = "count")

Upvotes: 1

Related Questions