Importing a large mixed CSV with time series into R

Question

Im a starter with R and have been combing the site for answers, but no luck with my specific problem.

Basically I need to import about 15 CSV files and work on each (plotting and analysis). They each contain thousands of entries and are time series with 15-20 samples each. The format is such that there are dates at the top, as well as other column names on the top row which I need to keep, and then the numerics which are percentages which need to be retained through the import.

So far I have worked out how to import, process and clean, but everything gets stored as characters and I lose significant figures.

I really hope for some advice on work flow and suitable import processing functions, as I need to keep the dates and first column of names (which I can store as index vectors or data frames), and then work on the data itself (so that I can plot individually, or in summarised combinations).

Any advice for this (also including good work flow practices for handling this systematically) would be greatly appreciated. I have been using "readr" and the base functions but each time I run the first import, I cant get past the problem of the data being converted to characters which I need to avoid, although I need to also keep the dates.

Is there a way around this by some wondrous R techniques, or do I need to work on pre-formatting these many atrocious CSV files first?

image of what I am dealing with, if it were multiplied by many many times in both directions

r2evans · Accepted Answer

If your files are all formatted the same, then you can "skip" rows to get things right. In your image example, I'm inferring the data to look something like this, and the code to read in one file is simply:

dat <- read.csv(text='NAME,% Outstanding,% Outstanding,% Outstanding
,31/12/2018,31/12/2017,31/12/2015
Name01,0.1,0.2,0.3
Name02,0.4,0.5,0.6',
header=TRUE, stringsAsFactors=FALSE, skip=1)
dat
#        X X31.12.2018 X31.12.2017 X31.12.2015
# 1 Name01         0.1         0.2         0.3
# 2 Name02         0.4         0.5         0.6

I chose to skip the top one, since most of the names were redundant and not really helpful. Notice that the names were munged a little; see ?make.names for column naming rules that R enforces.

From here, though a little bit preference, I strongly recommend converting from "wide" to "tall" format. Two reasons, at least: (1) the column names are actually data, in that you may want/need to know that from one batch of data to another is "1 year apart"; (2) many tools such as ggplot2 and even base graphics (with their formula methods) prefer the tall method.

I'll use some from the tidyverse, and fix the first problem:

colnames(dat)[1] <- "Name"

library(dplyr)
library(tidyr)

First, a note about wide-to-tall:

gather(dat, date, pct, -Name)
#     Name        date pct
# 1 Name01 X31.12.2018 0.1
# 2 Name02 X31.12.2018 0.4
# 3 Name01 X31.12.2017 0.2
# 4 Name02 X31.12.2017 0.5
# 5 Name01 X31.12.2015 0.3
# 6 Name02 X31.12.2015 0.6

This brings the "date" into the data-realm. The fact that they are munged a little (lossless) is fine, we can sort that out with the next step:

gather(dat, date, pct, -Name) %>%
  mutate(
    date = as.Date(date, format = "X%d.%m.%Y")
  )
#     Name       date pct
# 1 Name01 2018-12-31 0.1
# 2 Name02 2018-12-31 0.4
# 3 Name01 2017-12-31 0.2
# 4 Name02 2017-12-31 0.5
# 5 Name01 2015-12-31 0.3
# 6 Name02 2015-12-31 0.6

From here, doing this for multiple files should be a matter of:

filelist <- list.files("/path/to/mydir/", pattern="*.csv")
alldata <- lapply(filelist, read.csv, header=TRUE, stringsAsFactors=FALSE, skip=1)
alldata2 <- lapply(alldata, function(dat) {
  gather(dat, date, pct, -Name) %>%
    mutate(date = as.Date(date, format = "X%d.%m.%Y"))
})
combineddata <- do.call(rbind.data.frame, alldata2)

(Assuming all files have the same 2-row header.)

Importing a large mixed CSV with time series into R

Answers (1)

Related Questions