ozhank
ozhank

Reputation: 69

R data processing and forecasting

I am a new R user. I have data in the following xls file

nKPI    December-2012   July-2013   January-2014    July-2014   January-2015    June-2015   January-2016    July-2016

NKPI-03001 0.13 0.25    0.23    0.09    0.07    0.08    0.19    0.14
NKPI-03002 0.23 0.22    0.21    0.16    0.20    0.22    0.32    0.37
NKPI-03003 0.38 0.41    0.44    0.36    0.32    0.28    0.36    0.35
NKPI-03004 0.47 0.37    0.49    0.38    0.41    0.43    0.51    0.54
NKPI-03005 0.24 0.41    0.55    0.43    0.41    0.42    0.54    0.52
NKPI-03006 0.31 0.38    0.39    0.36    0.34    0.40    0.59    0.55
NKPI-03008 0.20 0.21    0.17    0.09    0.10    0.13    0.25    0.29

There are 704 rows of nkpi entries to process.

I need to forecast a value for july 2017 and jan 2018 using this data and create a plot for each kpi.

I can read the data into a data frame and drop rows with missing data as follows:

kpi_df <- read.xls("ochre_kpi.xls", header=TRUE)
# drop rows with no or missing data
kpi_df <- na.omit(kpi_df)

At this stage I get lost. I thank in advance any one who can offer guidance and assistance

Upvotes: 0

Views: 141

Answers (1)

Nina Sonneborn
Nina Sonneborn

Reputation: 52

It's best to work in tidy data format in R (if you're not sure what this is, get googling). I'm a big fan of tidy tools in the tidyverse library. If you're curious as to why I think it's preferred, you could read Hadley Wickham's tidy tools manifesto. You can find tutorials online, specifically DataCamp, and look to the RStudio cheatsheets for help (RStudio -> Help -> Cheatsheets). In terms of getting started on the analysis above, this should do.

Note: when loading a package (this is done with the call library(name_of_package)) for the first time, you'll need to call install.packages('name_of_package') to install the package.

Start with data cleaning. To get data into tidy format:

library(tidyverse)
kpi <- dplyr::gather(kpi_df, key="date", value="value")

This would make your table kpi look like:

nKPI             date                   value
NKPI-03001       December-2012          0.13
NKPI-03001       July-2013              0.25

The next thing to do would be to look get R to understand that the date column has dates in it. I'd normally recommend lubridate::parse_date_time which is on page 37 of this documentation. However, since your dates only have year and month, you run in to the same problem as discussed here. To get around that, the zoo package is good, so no lubridate this time. The code to fix your dates would be:

library(zoo)
kpi <- kpi %>% mutate(date = zoo::as.Date(zoo::as.yearmon(date, "%B-%Y"))

Now your data is cleaned and ready to go!

To plot: I'd use ggplot() because of the facet abilities, since you want a plot for each kpi.

# Plot value over time
ggplot(data=kpi, aes(x=date, y=value) +
    # Type of plot is scatter plot
    geom_point() +
    # Separate plots by the nKPI variable
    facet_grid(~nKPI)

As for making a prediction, the function to generate a linear regression model is lm(). You can read about it by typing ?lm into your R console.

Hope this all helps! Welcome to R!

Upvotes: 1

Related Questions