bmc
bmc

Reputation: 857

r - ggplot multiple line graphs for each unique instance over time

The Problem

Plotting a bunch of line plots on top of one another, but I only want to color 10 specifically after they are all plotted amongst themselves (to visualize how my 'targets' traveled over time while being able to view the masses of other behind them. So an example of this would be like 100 line graphs over time, but I want to color 5 or 10 of them specifically to discuss about with respect to the trend of the 90 other grayscale ones.

The following post has a pretty good image that I want to replicate, but with slightly more meat on the bones, , Except I want MANY lines behind those 3 all grayscale, but those 3 are my highlighted cities I want to see in the foreground, per say.

My original data was in the following form:

# The unique identifier is a City-State combo, 
# there can be the same cities in 1 state or many. 
# Each state's year ranges from 1:35, but may not have
# all of the values available to us, but some are complete.

r1 <- c("city1" , "state1" , "year" , "population" , rnorm(11) , "2")
r2 <- c("city1" , "state2" , "year" , "population" , rnorm(11) , "3")
r3 <- c("city2" , "state1" , "year" , "population" , rnorm(11) , "2")
r4 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "1")
r5 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "7")

df <- data.frame(matrix(nrow = 5, ncol = 16))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5

names(df) <- c("City", "State", "Year", "Population", 1:11, "Cluster")

head(df)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# City | State | Year | Population  | ... 11 Variables ... | Cluster    #
# ----------------------------------------------------------------------#
# Each row is a city instance with these features ...                   #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

But I thought it might be better to view the data differently, so I also have it in the following format. I am not sure which is better for this problem.

cols <- c(0:35)
rows <- c("unique_city1", "unique_city2","unique_city3","unique_city4","unique_city5")
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)

df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5

names(df) <- cols
row.names(df) <- rows

head(df)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
#                       Year1 Year2 .......... Year 35  #
# UniqueCityState1       VAL    NA  ..........  VAL     #
# UniqueCityState2       VAL    VAL ..........  NA      #
#         .                                             #
#         .                                             #
#         .                                             #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

Prior Attempts

I have tried using melt to get the data into a format that is possible for ggplot to accept and plot each of these cities over time, but nothing has seemed to work. Also, I have tried creating my own functions to loop through each of my unique city-state combinations to stack ggplots which had some fair amount of research available on the topic, but nothing yet still. I am not sure how I could find each of these unique citystate pairs and plot them over time taking their cluster value or any numeric value for that matter. Or maybe what I am seeking is not possible, I am not sure.

Thoughts?

EDIT: More information about data structure

> head(df)
        city state year population    stat1 stat2 stat3 stat4 stat5
1       BESSEMER     1    1      31509 0.3808436            0 0.63473928   2.8563268    9.5528262
2     BIRMINGHAM     1    1     282081 0.3119671            0 0.97489728   6.0266377    9.1321287
3 MOUNTAIN BROOK     1    1      18221 0.0000000            0 0.05488173   0.2744086    0.4390538
4      FAIRFIELD     1    1      12978 0.1541069            0 0.46232085   3.0050855    9.8628448
5     GARDENDALE     1    1       7828 0.2554931            0 0.00000000   0.7664793    1.2774655
6          LEEDS     1    1       7865 0.2542912            0 0.12714558   1.5257470   13.3502861
  stat6 stat6 stat7 stat8 stat9 cluster
1     26.976419     53.54026  5.712654                    0               0.2856327       9
2     35.670605     65.49183 11.982374                    0               0.4963113       9
3      6.311399     21.40387  1.426925                    0               0.1097635       3
4     21.266759     68.11527 11.480968                    0               1.0787487       9
5      6.770567     23.24987  3.960143                    0               0.0000000       3
6     24.157661     39.79657  4.450095                    0               1.5257470      15
    agg
1  99.93970
2 130.08675
3  30.02031
4 115.42611
5  36.28002
6  85.18754

And ultimately I need it in the form of unique cities as row.names, 1:35 as col.names and the value inside each cell to be agg if that year was present or NA if it wasn't. Again I am sure this is possible, I just can't attain a good solution to it and my current way is unstable.

Upvotes: 2

Views: 1858

Answers (1)

www
www

Reputation: 39174

If I understand your question correctly, you want to plot all the lines in one color, and then plot a few lines with several different colors. You may use ggplot2, calling geom_line twice on two data frames. The first time plot all city data without mapping lines to color. The second time plot just the subset of your target city and mapping lines to color. You will need to re-organize your original data frame and subset the data frame for the target city. In the following code I used tidyr and dplyr to process the data frame.

### Set.seed to improve reproducibility
set.seed(123)

### Load package
library(tidyr)
library(dplyr)
library(ggplot2)

### Prepare example data frame 
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)

df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5 

names(df) <- 1:35

df <- df %>% mutate(City = 1:5)

### Reorganize the data for plotting
df2 <- df %>%
  gather(Year, Value, -City) %>%
  mutate(Year = as.numeric(Year))

The gather function takes df as the first argument. It will create the key column called Year, which will store year number. The year number are the column names of each column in the df data frame except the City column. gather function will also create a column called Value, which will store all the numeric values from each column in in the df data frame except the City column. Finally, City column will not involve in this process, so use -City to tell the gather function "do not transform the data from the City column".

### Subset df2, select the city of interest
df3 <- df2 %>%
  # In this example, assuming that City 2 and City 3 are of interest
  filter(City %in% c(2, 3))

### Plot the data
ggplot(data = df2, aes(x = Year, y = Value, group = factor(City))) +
  # Plot all city data here in gray lines
  geom_line(size = 1, color = "gray") +
  # Plot target city data with colors
  geom_line(data = df3, 
            aes(x = Year, y = Value, group = City, color = factor(City)),
            size = 2) 

The resulting plot can be seen here: https://dl.dropboxusercontent.com/u/23652366/example_plot.png

Upvotes: 3

Related Questions