Reputation: 37
I am trying to make a ggplot of the following data which has information on when (date and time) a person (denoted by id) synced their data to the server. I have removed the date variable for simplicity.
district id year_sync time_sync
A 1 2020 12:03:19
A 2 2020 14:33:23
A 3 2020 13:14:30
A 4 2020 12:37:07
A 5 2020 12:45:48
A 6 2020 02:26:57
A 7 2020 08:10:03
A 8 2020 12:08:15
A 9 2020 15:21:52
A 10 2020 17:42:33
A 11 2020 14:23:29
A 12 2020 23:18:19
A 13 2020 12:39:14
A 14 2020 11:31:33
A 15 2020 13:00:14
A 16
A 17
A 18
A 19
A 20
A 21
B 22
B 23
B 24
B 25
B 26
B 27
B 28
B 29
B 30
B 31 2019 12:39:31
B 32 2019 11:44:39
B 33 2019 10:18:20
B 34 2019 18:11:48
B 35 2019 17:22:32
B 36 2019 12:17:23
B 37 2019 12:58:30
B 38 2019 18:50:29
B 39 2019 12:58:52
B 40 2019 21:12:36
B 41 2019 15:57:53
B 42 2019 12:52:44
B 43 2019 14:10:48
B 44 2019 15:40:08
B 45 2019 14:34:07
B 46 2019 02:40:28
B 47 2019 01:37:05
B 48 2019 14:36:01
B 49 2019 11:19:45
B 50 2019 15:33:42
B 51 2019 21:00:49
A 52 2020 15:02:01
A 53 2020 20:28:23
A 54 2020 17:02:37
A 55 2020 15:01:24
A 56 2020 11:29:02
A 57 2020 18:31:05
A 58 2020 12:07:51
A 59 2020 13:00:11
A 60 2020 09:35:08
A 61 2020 18:25:53
B 62 2020 18:12:51
B 63 2020 14:26:31
B 64 2020 14:46:51
B 65 2020 18:04:50
B 66 2020 07:08:21
B 67 2020 14:37:16
B 68 2020 11:56:24
B 69 2020 13:19:34
B 70 2019 15:34:24
B 71 2019 15:02:03
B 72 2019 11:05:08
B 73 2019 16:11:18
A 74 2019 23:51:36
A 75 2019 13:30:46
A 76 2019 12:28:43
A 77 2019 12:38:56
A 78 2019 11:22:05
A 79 2019 15:03:20
A 80 2019 11:27:34
I want to plot a yearly comparison graph, that is, how many IDs synced data in 2020 v/s 2019. For which I used the following code:
df1 <- df %>%
group_by(year_sync) %>%
dplyr::summarize(non_na_count = sum(!is.na(year_sync))) %>% ## I only want to calculate % based on non-missing values
setNames(., c('year', 'count')) %>%
mutate('share' = count/sum(count), label = paste0(round(share*100, 2), '%'))
ggplot(df1, aes(y=count, x=year)) +
geom_bar(stat='identity',
#color = "black"
#fill = c("aquamarine4", "bisque3"),
position = "dodge") +
geom_text(aes(label = label),
position = position_stack(vjust = 1.05),
size = 3) +
xlab ("Year") +
ylab ("Number of People") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "italic"))
This doesn't work quite well as I get my x-axis as 2018.0 2018.5 etc (below). I want the x-axis to have only 2019 and 2020.
Note: graph is as per original dataset. So don't worry on matching the %.
I would like help on the following: 1.1 Fix my x-axis (ADDRESSED)
1.2 Do a facet grid for districts wherein proportions (for labels) are calculated as per total observations within each district. (PENDING)
1.3 Fix Fill - I want the bars in different colors. However, somehow the fill is not working currently.(ADDRESSED)
EDIT For point 1.2: I am trying out the following code:
df2 <-
df %>% dplyr::filter(!is.na(year_sync)) ## filtering NAs
df3 <- df2 %>%
group_by(district) %>%
dplyr::mutate(ssum = n()) %>%
dplyr::count(year_sync, ssum) %>%
mutate(percent = n / ssum,
label = paste0(round(percent*100, 2), '%')) ## to calculate % based on total number of IDs in each district
ggplot(df3, aes(y=ssum, x=factor(year), fill=district)) +
geom_bar(stat='identity',
#color='black',
position = position_dodge(width=0.8), width=0.8) +
geom_text(aes(label = label, y=count+10),
position = position_dodge(width=0.8),
size = 3) +
xlab ("Year") +
ylab ("Number of People") +
scale_fill_manual(values=c("aquamarine4", "bisque3")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "italic"))
However, I am getting the following error: Error in unique.default(x, nmax = nmax) : unique() applies only to vectors . Can anyone tell me what's wrong?
Thank you!
Upvotes: 1
Views: 1125
Reputation: 13793
This is a two-in-one question, so here's a two-in-one solution:
To clarify for you on how you may fix the three points on your plot:
Fix the x-axis. Since df1$year
is classifed as an int
, the x axis is treated as a numeric/continuous axis, which is why "2019.5" makes sense for ggplot
. One way around that is to simply tell ggplot
it needs to treat df1$year
as a discrete axis, which can be done by forcing the year as a factor. You can do that prior to the ggplot()
call, or inline by indicating x=factor(year)
instead of x=year
within aes()
.
Facet grid for disctricts. You can use facet_grid()
for that, but you'll need to group your dataset also by district. That means adjusting some of the code you used to process df
into df1
(add the extra column name and add district
to your group_by()
function. You can then add a call to facet_grid()
passing . ~ district
to facet district into columns, or district ~ .
to facet district into rows.
Fix the fill color. ggplot
works on the principle that use of different color should convey some new information to your plot. Consequently, if you want the column fill to change for the different columns, it should be linked with something in your dataset. Here, I'll assume you want each district to be colored differently. For ggplot
to process that, you need to put fill=
into the aesthetics (aes()
), and link it to the district
column of you dataset. You can then either accept the default colors or specify them using scale_fill_manual(values=...)
.
Putting this all together, here's the new code to go from your original dataset to a new plot:
df1 <- df %>%
group_by(district, year_sync) %>%
dplyr::summarize(non_na_count = sum(!is.na(year_sync))) %>% ## I only want to calculate % based on non-missing values
setNames(., c('district', 'year', 'count')) %>%
mutate('share' = count/sum(count), label = paste0(round(share*100, 2), '%'))
ggplot(df1, aes(y=count, x=factor(year), fill=district)) +
geom_bar(stat='identity', color='black') +
# note I've pushed the labels up slightly using count+1.
# also note you don't want to use position="stack" here for the text.
geom_text(aes(label = label, y=count+1), size = 3) +
xlab ("Year") +
ylab ("Number of People") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "italic")) +
scale_fill_manual(values=c("aquamarine4", "bisque3")) +
facet_grid(. ~ district)
While not your question, I would also recommend that rather than faceting you use "dodging" to showcase the two districts. Depending on the point of the plot, dodged columns are a better way for comparing the districts with each other for any given x value (year). The code changes a bit for that to work for the plot portion. The most important thing to note is that you need to use position=position_dodge()
and specify dodging for both geom_bar()
and geom_text()
. Both will use the fill=
aesthetic here as the column in your dataset by which to "dodge":
ggplot(df1, aes(y=count, x=factor(year), fill=district)) +
geom_bar(stat='identity', color='black',
position = position_dodge(width=0.8), width=0.8) +
geom_text(aes(label = label, y=count+1),
position = position_dodge(width=0.8), size = 3) +
xlab ("Year") +
ylab ("Number of People") +
scale_fill_manual(values=c("aquamarine4", "bisque3")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "italic"))
For this, you have to ensure your df$time_sync
column is formatted in a recognizable "date" or "datetime" format. @yingw had it close, but not quite, as that column needs to be set to as.POSIXct()
in order to to work. Following that, you can draw the histogram by simply using geom_histogram()
and setting your x=
aesthetic as the transformed df$time_sync
column. The problem you will run into is that the date axis by default includes Date and time now... even though your data only had time. To strip away the date portion and only show the time, I'm using the scales
library to control the formatting via scale_x_date()
and date_format()
as well as date_breaks()
to set the breaks and labels for that scale.
library(scales)
df %>% dplyr::filter(!is.na(time_sync)) %>%
ggplot(aes(as.POSIXct(time_sync, format = "%H:%M:%S"))) +
geom_histogram(color='black', fill='bisque3') +
scale_x_datetime(labels=date_format("%H:%M:%S"), date_breaks="3 hours") +
xlab('Time of Day')
Upvotes: 1
Reputation: 307
For your first question
x=year
with x=factor(year)
fill= district
For your second question, you'll probably want to use geom_histogram()
and strptime
function, something like
df %>%
filter(!is.na(time_sync)) %>%
ggplot(aes(strptime(time_sync, format = "%H:%M:%S"))) %>%
geom_histogram()
Upvotes: 0