hongpastry
hongpastry

Reputation: 181

subway-style graph for word frequency across three datasets in ggplot2

this question is a followup from https://stackoverflow.com/a/64991805?noredirect=1 i have a dataset with one dictionary, dict, and the word frequencies of each word within dictionary from dataset, dictfreq$freq_gov, dictfreq$freq_indiv, dictfreq$freq_media.

dict: apple, pear, pineapple
freq_gov: 12, 13, 10
freq_indiv: 11, 20, 1
freq_media: 13, 21, 9

desired output looks like this: https://blog.revolutionanalytics.com/2015/12/r-is-the-fastest-growing-language-on-stackoverflow.html where y-axis has:

- rank going from 1-3
- list of the words from dict (apple, pear, pineapple), and 

and x-axis has:

- categories of freq_gov, freq_indiv, freq_media

basically, i want to visualize comparison of the frequency of each word in dict across gov, indiv, and media.

this is the code template i have been trying to revise so far:

p <- ggplot(mapping = aes(dictfreq, y = rank, group = tag, color = tag)) +
  geom_line(size = 1.7, alpha = 0.25, data = dictfreq) +
  geom_line(size = 2.5, data = dictfreq %>% filter(tag %in% names(colors)[colors != "gray"])) +
  geom_point(size = 4, alpha = 0.25, data = dictfreq) +
  geom_point(size = 4, data = dftags4 %>% filter(tag %in% names(colors)[colors != "gray"])) +
  geom_point(size = 1.75, color = "white", data = dictfreq) +
  geom_text(data = dftags5, aes(label = tag), hjust = -0, size = 4.5) +
  geom_text(data = dftags6, aes(label = tag), hjust = 1, size = 4.5) +
  scale_color_manual(values = colors) +
  ggtitle("The subway-style-rank-year-tag plot:\nPast and the Future") +
  xlab("Top Tags by Year in Stackoverflow") +
  scale_x_continuous(breaks = seq(min(dftags4$creationyear) - 2,
                                 max(dftags4$creationyear) + 2),
                     limits = c(min(dftags4$creationyear) - 1.0,
                                max(dftags4$creationyear) + 0.5))
p

but i am having trouble molding it to my data. specifically, my x axis will be three categorical sections (media, gov, indiv) that is not a separate variable in my data. what should i do??

--

edit: including the actual data here - dput() as suggested:

structure(list(word = c("apple", "apple", "apple", 
"mandarin", "mandarin", "mandarin", "orange", "orange", "orange", "pear"), 
    name = c("freq_ongov", "freq_onindiv", "freq_onmedia", "freq_ongov", 
    "freq_onindiv", "freq_onmedia", "freq_ongov", "freq_onindiv", 
    "freq_onmedia", "freq_ongov"), value = c(0, 87, 63, 0, 44, 
    20, 3, 27, 25, 0), rank = c(26, 85, 70, 26, 61, 42.5, 86, 
    47, 48, 26)), row.names = c(NA, -10L), groups = structure(list(
    name = c("freq_ongov", "freq_onindiv", "freq_onmedia"), .rows = structure(list(
        c(1L, 4L, 7L, 10L), c(2L, 5L, 8L), c(3L, 6L, 9L)), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, 3L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

should be noted that the actual data has 160 unique dict words!

--

update: i went according to Allan's suggestion, and the pivotlonger() function worked, but am stuck on an error when i try to generate the actual ggplot. this is my code:

ggplot(mergedicts, aes(name, rank, color = word, group = word)) +
  geom_line(size = 200) +
  geom_point(shape = 21, fill = "white", size =200) +
  scale_y_continuous(breaks = seq(max(mergedicts$rank)), labels = leftlabels,
                     sec.axis = sec_axis(~., breaks = seq(max(mergedicts$rank)), 
                                         labels = rightlabels)) +
  scale_x_discrete(expand = c(0.01, 0)) +
  guides(color = guide_none()) +
  coord_cartesian(clip = "off") +
  theme(axis.ticks.length.y = unit(0, "points"))

which gives the error:

Error: `breaks` and `labels` must have the same length Run `rlang::last_error()` to see where the error occurred.
6.
stop(fallback)
5.
signal_abort(cnd)
4.
abort("`breaks` and `labels` must have the same length")
3.
check_breaks_labels(breaks, labels)
2.
continuous_scale(c("y", "ymin", "ymax", "yend", "yintercept", "ymin_final", "ymax_final", "lower", "middle", "upper", "y0"), "position_c", identity, name = name, breaks = breaks, n.breaks = n.breaks, minor_breaks = minor_breaks, labels = labels, limits = limits, ...
1.
scale_y_continuous(breaks = seq(max(mergedicts$rank)), labels = leftlabels, sec.axis = sec_axis(~., breaks = seq(max(mergedicts$rank)), labels = rightlabels))

any suggestions??

Upvotes: 0

Views: 249

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 174468

It's hard to follow your example, because your data is not presented in a standard way. I think you mean you have a data frame with four columns like this:

dictfreq <- data.frame(dict = c("apple", "pear", "pineapple"),
                       freq_gov =  c(12, 13, 10),
                       freq_indiv =  c(11, 20, 1),
                       freq_media = c(13, 21, 9))

dictfreq
#>        dict freq_gov freq_indiv freq_media
#> 1     apple       12         11         13
#> 2      pear       13         20         21
#> 3 pineapple       10          1          9

Now if that is the case, your first task will be to pivot this data into long format, and get the rank for each of the three categorical variables:

library(ggplot2)
library(dplyr)
library(tidyr)

df <- pivot_longer(dictfreq, -1) %>% group_by(name) %>% mutate(rank = rank(value))
df
#> # A tibble: 9 x 4
#> # Groups:   name [3]
#>   dict      name       value  rank
#>   <fct>     <chr>      <dbl> <dbl>
#> 1 apple     freq_gov      12     2
#> 2 apple     freq_indiv    11     2
#> 3 apple     freq_media    13     2
#> 4 pear      freq_gov      13     3
#> 5 pear      freq_indiv    20     3
#> 6 pear      freq_media    21     3
#> 7 pineapple freq_gov      10     1
#> 8 pineapple freq_indiv     1     1
#> 9 pineapple freq_media     9     1

Note that for your example, the rank doesn't change for the three dictionary items across your categories: pear is always the highest, followed by apple, followed by pineapple. This doesn't make for a very interesting plot, but let's roll with it for now. You will need to define the labels for the left hand and right hand axis according to which fruit should appear there. You can do that like this:

leftlabels <- df$dict[df$name == "freq_gov"]
leftlabels <- leftlabels[order(df$rank[df$name == "freq_gov"])]

rightlabels <- df$dict[df$name == "freq_media"]
rightlabels <- rightlabels[order(df$rank[df$name == "freq_media"])]

Now you are ready to plot. You will need to include a secondary axis:

ggplot(df, aes(name, rank, color = dict, group = dict)) +
  geom_line(size = 4) +
  geom_point(shape = 21, fill = "white", size = 4) +
  scale_y_continuous(breaks = seq(max(df$rank)), labels = leftlabels,
                     sec.axis = sec_axis(~., breaks = seq(max(df$rank)), 
                                         labels = rightlabels)) +
  scale_x_discrete(expand = c(0.01, 0)) +
  guides(color = guide_none()) +
  coord_cartesian(clip = "off") +
  theme(axis.ticks.length.y = unit(0, "points"))

Like I say, this is not a very interesting plot because that's just what the data show. However, if we try it with slightly more interesting data:

dictfreq <- data.frame(dict = c("apple", "pear", "pineapple", "banana", "kiwi"),
                       freq_gov =  c(10, 13, 9, 14, 11),
                       freq_indiv =  c(11, 22, 1, 6, 16),
                       freq_media = c(13, 21, 9, 10, 8))

Now we run exactly the same code and we can see this is much closer to the kind of thing you are looking for:

enter image description here

Upvotes: 1

Related Questions