Karan Chaudhary
Karan Chaudhary

Reputation: 21

ggplot: Plotting timeseries data with missing values

I have been trying to plot a graph between two columns from a data frame which I had created. The data values stored in the first column is daily time data named "Time"(format- YYYY-MM-DD) and the second column contains precipitation magnitude, which is a numeric value named "data1".

This data is taken from an excel file "St Lucia3" which has a total 11598 data points and stores daily precipitation data from 1981 to 2018 in two columns:

  1. YearMonthDay (format- "YYYYMMDD", example "19810501")

  2. Rainfall (mm)

The code for importing data into R:

StLucia <- read_excel("C:/Users/hp/Desktop/St Lucia3.xlsx")

The code for time data "Time" :

Time <- as.Date(as.character(StLucia$YearMonthDay), format= "%Y%m%d")

The code for precipitation data "data1" :

library("imputeTS")
data1 <- na_ma(StLucia$`Rainfall (mm)`, k = 4, weighting = "exponential")

The code for data frame "Pecip1" :

Precip1 <- data.frame(Time, data1, check.rows=TRUE)

The code for ggplot is:

ggplot(data = Precip1, mapping= aes(x= Time, y= data1)) + geom_line()

Using ggplot for plotting the graph between "Time" and "data1" results as:Link to the Rplot between "data1" and "Time"

Can someone please explain to me why there is an "unusual kink" like behavior at the right end of the graph, even though there are no such values in the column "data1".

The plot of "data1" data against its index is as shown:Link for Rplot for "data1" against its index

The code for this plot is:

plot(data1, type = "l")

Any help would be highly appreciated. Thanks!

Upvotes: 1

Views: 9763

Answers (2)

Gregor Thomas
Gregor Thomas

Reputation: 145755

Here is a reproducible example - change the names to match your data.

# create sample data
set.seed(47)
dd = data.frame(t = Sys.Date() + c(0:5, 30:32), y = runif(9))

# demonstrate problem
ggplot(dd, aes(t, y)) +
    geom_point() +
    geom_line()

enter image description here

The easiest solution, as Tung points out, is to use a more appropriate geom, like geom_col:

ggplot(dd, aes(t, y)) +
    geom_col()

enter image description here

If you really want to use lines, you should fill in the missing dates with NA for rainfall. H

# calculate all days
all_days = data.frame(t = seq.Date(from = min(dd$t), to = max(dd$t), by = "day"))
# join to original data
library(dplyr)
dd_complete = left_join(all_days, dd, by = "t")

# ggplot won't connect lines across missing values
ggplot(dd_complete, aes(t, y)) +
    geom_point() +
    geom_line()

enter image description here

Alternately, you could replace the missing values with 0s to have the line just go along the axis, but I think it's nicer to not plot the line, which implies no data/missing data, rather than plot 0s which implies no rainfall.

Upvotes: 1

Chabo
Chabo

Reputation: 3000

By using pad we can make up for those lost values an assign an NA value as to avoid plotting in the region of missing data.

library(padr)
library(zoo)

YearMonthDay<-c(19810501,19810502,19810504,19810505)
Data<-c(1,2,3,4)

StLucia<-data.frame(YearMonthDay,Data)

 StLucia$YearMonthDay <- as.Date(as.character(StLucia$YearMonthDay), format= 
 "%Y%m%d")

> StLucia
  YearMonthDay Data
1   1981-05-01    1
2   1981-05-02    2
3   1981-05-04    3
4   1981-05-05    4

Note: you can see we are missing a date, but still there is no gap between position 2 and 3, thus plotting versus indexing you would not see a gap.

So lets add the missing date:

 StLucia<-pad(StLucia,interval="day")

> StLucia
   YearMonthDay Data
 1   1981-05-01    1
 2   1981-05-02    2
 3   1981-05-03   NA
 4   1981-05-04    3
 5   1981-05-05    4

 plot(StLucia, type = "l")

enter image description here

If you want to fill in those NA values, use na.locf() from package(zoo)

Upvotes: 3

Related Questions