Rachita
Rachita

Reputation: 37

Plotting yearly comparison & time distribution in ggplot2 R

I am trying to make a ggplot of the following data which has information on when (date and time) a person (denoted by id) synced their data to the server. I have removed the date variable for simplicity.

district id year_sync time_sync
    A   1   2020    12:03:19
    A   2   2020    14:33:23
    A   3   2020    13:14:30
    A   4   2020    12:37:07
    A   5   2020    12:45:48
    A   6   2020    02:26:57
    A   7   2020    08:10:03
    A   8   2020    12:08:15
    A   9   2020    15:21:52
    A   10  2020    17:42:33
    A   11  2020    14:23:29
    A   12  2020    23:18:19
    A   13  2020    12:39:14
    A   14  2020    11:31:33
    A   15  2020    13:00:14
    A   16      
    A   17      
    A   18      
    A   19      
    A   20      
    A   21      
    B   22      
    B   23      
    B   24      
    B   25      
    B   26      
    B   27      
    B   28      
    B   29      
    B   30      
    B   31  2019    12:39:31
    B   32  2019    11:44:39
    B   33  2019    10:18:20
    B   34  2019    18:11:48
    B   35  2019    17:22:32
    B   36  2019    12:17:23
    B   37  2019    12:58:30
    B   38  2019    18:50:29
    B   39  2019    12:58:52
    B   40  2019    21:12:36
    B   41  2019    15:57:53
    B   42  2019    12:52:44
    B   43  2019    14:10:48
    B   44  2019    15:40:08
    B   45  2019    14:34:07
    B   46  2019    02:40:28
    B   47  2019    01:37:05
    B   48  2019    14:36:01
    B   49  2019    11:19:45
    B   50  2019    15:33:42
    B   51  2019    21:00:49
    A   52  2020    15:02:01
    A   53  2020    20:28:23
    A   54  2020    17:02:37
    A   55  2020    15:01:24
    A   56  2020    11:29:02
    A   57  2020    18:31:05
    A   58  2020    12:07:51
    A   59  2020    13:00:11
    A   60  2020    09:35:08
    A   61  2020    18:25:53
    B   62  2020    18:12:51
    B   63  2020    14:26:31
    B   64  2020    14:46:51
    B   65  2020    18:04:50
    B   66  2020    07:08:21
    B   67  2020    14:37:16
    B   68  2020    11:56:24
    B   69  2020    13:19:34
    B   70  2019    15:34:24
    B   71  2019    15:02:03
    B   72  2019    11:05:08
    B   73  2019    16:11:18
    A   74  2019    23:51:36
    A   75  2019    13:30:46
    A   76  2019    12:28:43
    A   77  2019    12:38:56
    A   78  2019    11:22:05
    A   79  2019    15:03:20
    A   80  2019    11:27:34
  1. I want to plot a yearly comparison graph, that is, how many IDs synced data in 2020 v/s 2019. For which I used the following code:

    df1 <- df %>%
         group_by(year_sync) %>%
         dplyr::summarize(non_na_count = sum(!is.na(year_sync))) %>% ## I only want to calculate % based on non-missing values 
         setNames(., c('year', 'count')) %>%
         mutate('share' = count/sum(count), label = paste0(round(share*100, 2), '%'))
    
         ggplot(df1, aes(y=count, x=year)) +
           geom_bar(stat='identity',
                    #color = "black"
                    #fill = c("aquamarine4", "bisque3"),
                    position = "dodge") +
           geom_text(aes(label = label),
                     position = position_stack(vjust = 1.05),
                     size = 3) +
           xlab ("Year")   +
           ylab ("Number of People")  +
           theme_minimal() +
           theme(plot.title = element_text(hjust = 0.5, face = "bold"),
                 plot.subtitle = element_text(hjust = 0.5, face = "italic"))
    

This doesn't work quite well as I get my x-axis as 2018.0 2018.5 etc (below). I want the x-axis to have only 2019 and 2020.

enter image description here Note: graph is as per original dataset. So don't worry on matching the %.

I would like help on the following: 1.1 Fix my x-axis (ADDRESSED)

1.2 Do a facet grid for districts wherein proportions (for labels) are calculated as per total observations within each district. (PENDING)

1.3 Fix Fill - I want the bars in different colors. However, somehow the fill is not working currently.(ADDRESSED)

  1. I would also like to plot a time distribution for time_sync to know about when people usually sync their data. However, I am unable to do so. (ADDRESSED)

EDIT For point 1.2: I am trying out the following code:

df2 <-
    df %>% dplyr::filter(!is.na(year_sync)) ## filtering NAs

df3 <- df2 %>%
      group_by(district) %>%
      dplyr::mutate(ssum = n()) %>%
      dplyr::count(year_sync, ssum)  %>% 
      mutate(percent = n / ssum,
             label = paste0(round(percent*100, 2), '%')) ## to calculate % based on total number of IDs in each district

plotting

    ggplot(df3, aes(y=ssum, x=factor(year), fill=district)) +
      geom_bar(stat='identity',
               #color='black',
               position = position_dodge(width=0.8), width=0.8) +
      geom_text(aes(label = label, y=count+10),
                position = position_dodge(width=0.8),
                size = 3) +
      xlab ("Year")   +
      ylab ("Number of People")  +
      scale_fill_manual(values=c("aquamarine4", "bisque3")) +
      theme_minimal() +
      theme(plot.title = element_text(hjust = 0.5, face = "bold"),
            plot.subtitle = element_text(hjust = 0.5, face = "italic"))

However, I am getting the following error: Error in unique.default(x, nmax = nmax) : unique() applies only to vectors . Can anyone tell me what's wrong?

Thank you!

Upvotes: 1

Views: 1125

Answers (2)

chemdork123
chemdork123

Reputation: 13793

This is a two-in-one question, so here's a two-in-one solution:

Fix the bar plot

To clarify for you on how you may fix the three points on your plot:

  1. Fix the x-axis. Since df1$year is classifed as an int, the x axis is treated as a numeric/continuous axis, which is why "2019.5" makes sense for ggplot. One way around that is to simply tell ggplot it needs to treat df1$year as a discrete axis, which can be done by forcing the year as a factor. You can do that prior to the ggplot() call, or inline by indicating x=factor(year) instead of x=year within aes().

  2. Facet grid for disctricts. You can use facet_grid() for that, but you'll need to group your dataset also by district. That means adjusting some of the code you used to process df into df1 (add the extra column name and add district to your group_by() function. You can then add a call to facet_grid() passing . ~ district to facet district into columns, or district ~ . to facet district into rows.

  3. Fix the fill color. ggplot works on the principle that use of different color should convey some new information to your plot. Consequently, if you want the column fill to change for the different columns, it should be linked with something in your dataset. Here, I'll assume you want each district to be colored differently. For ggplot to process that, you need to put fill= into the aesthetics (aes()), and link it to the district column of you dataset. You can then either accept the default colors or specify them using scale_fill_manual(values=...).

Putting this all together, here's the new code to go from your original dataset to a new plot:

df1 <- df %>%
  group_by(district, year_sync) %>%
  dplyr::summarize(non_na_count = sum(!is.na(year_sync))) %>% ## I only want to calculate % based on non-missing values 
  setNames(., c('district', 'year', 'count')) %>%
  mutate('share' = count/sum(count), label = paste0(round(share*100, 2), '%'))


ggplot(df1, aes(y=count, x=factor(year), fill=district)) +
  geom_bar(stat='identity', color='black') +
  # note I've pushed the labels up slightly using count+1.
  # also note you don't want to use position="stack" here for the text.
  geom_text(aes(label = label, y=count+1), size = 3) +
  xlab ("Year")   +
  ylab ("Number of People")  +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, face = "italic")) +
  scale_fill_manual(values=c("aquamarine4", "bisque3")) +
  facet_grid(. ~ district)

enter image description here

[Bonus] A different bar plot?

While not your question, I would also recommend that rather than faceting you use "dodging" to showcase the two districts. Depending on the point of the plot, dodged columns are a better way for comparing the districts with each other for any given x value (year). The code changes a bit for that to work for the plot portion. The most important thing to note is that you need to use position=position_dodge() and specify dodging for both geom_bar() and geom_text(). Both will use the fill= aesthetic here as the column in your dataset by which to "dodge":

ggplot(df1, aes(y=count, x=factor(year), fill=district)) +
  geom_bar(stat='identity', color='black',
           position = position_dodge(width=0.8), width=0.8) +
  geom_text(aes(label = label, y=count+1),
            position = position_dodge(width=0.8), size = 3) +
  xlab ("Year")   +
  ylab ("Number of People")  +
  scale_fill_manual(values=c("aquamarine4", "bisque3")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, face = "italic"))

enter image description here

Plotting a Histogram for the time distribution

For this, you have to ensure your df$time_sync column is formatted in a recognizable "date" or "datetime" format. @yingw had it close, but not quite, as that column needs to be set to as.POSIXct() in order to to work. Following that, you can draw the histogram by simply using geom_histogram() and setting your x= aesthetic as the transformed df$time_sync column. The problem you will run into is that the date axis by default includes Date and time now... even though your data only had time. To strip away the date portion and only show the time, I'm using the scales library to control the formatting via scale_x_date() and date_format() as well as date_breaks() to set the breaks and labels for that scale.

library(scales)

df %>% dplyr::filter(!is.na(time_sync)) %>%
  ggplot(aes(as.POSIXct(time_sync, format = "%H:%M:%S"))) +
  geom_histogram(color='black', fill='bisque3') +
  scale_x_datetime(labels=date_format("%H:%M:%S"), date_breaks="3 hours") +
  xlab('Time of Day')

enter image description here

Upvotes: 1

yingw
yingw

Reputation: 307

For your first question

  1. replacing x=year with x=factor(year)
  2. add + facet_grid(factor(district)~.)
  3. you'll need a new column that holds the color , or fill= district

For your second question, you'll probably want to use geom_histogram() and strptime function, something like

df %>%
    filter(!is.na(time_sync)) %>%
    ggplot(aes(strptime(time_sync, format = "%H:%M:%S"))) %>%
    geom_histogram()

Upvotes: 0

Related Questions