bevingtona
bevingtona

Reputation: 301

Read CSV with variable rows to skip, bulk

I am trying to make a loop that reads in multiple CSV files that all have the same type of air temperature data. However there are rows that I want to skip above the data. These are "alarms" in the dataset. Each file may have a different amount of alarms, thus a different amount of rows to skip. See below:

-------------First CSV file---------------
Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2

-------------Second CSV file---------------
Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Alarm 5
Alarm 6
Alarm 7
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2

How can I get the index of Date, Temp to tell read.csv to skip to that row?

for (i in 1:length(csv.list))  {
  df = read.csv(csv.list[i], header = T, skip=????????)
}

Upvotes: 7

Views: 1413

Answers (4)

user14382
user14382

Reputation: 969

I think there is a much simpler approach, which is reading the files as they are, with a few options.

df = read.csv('weather-logger2.csv', colClasses = c('character', 'numeric'),
              col.names = c('Date', 'Temp'),
              na.strings = 'Temp', fill = TRUE, stringsAsFactors = FALSE)

The first few lines contain strings that are not dates, which will give NA when you convert them to dates (you'll have to convert this way anyway for further processing), and then you just subset getting rid of the missing values:

     Date Temp
1     Alarm 1   NA
2     Alarm 2   NA
3     Alarm 3   NA
4     Alarm 4   NA
5     Alarm 5   NA
6     Alarm 6   NA
7     Alarm 7   NA
8        Date   NA
9  01/01/2011 -1.2
10 01/02/2011 -1.3
11 01/03/2011 -1.1
12 01/04/2011 -1.2

df$Date = as.Date(df$Date, '%d/%m/%Y')
df = subset(df, !is.na(Date))

to get:

         Date Temp
9  2011-01-01 -1.2
10 2011-02-01 -1.3
11 2011-03-01 -1.1
12 2011-04-01 -1.2

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

fread from the "data.table" package might be useful for you since it is pretty good at auto-detecting "junk" header rows. Here's an example:

First, create two sample csv files

cat("Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2
", file = "socsv1.csv", sep = "\n")

cat("Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Alarm 5
Alarm 6
Alarm 7
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2", file = "socsv2.csv", sep = "\n")

## Check that they were created
list.files(pattern = "socsv")
# [1] "socsv1.csv" "socsv2.csv"

Now, just use fread and specify the sep instead of letting fread guess.

library(data.table)
lapply(list.files(pattern = "socsv"), fread, sep = ",")
# [[1]]
#          Date  Temp
# 1: 01/01/2011  -1.2
# 2: 01/02/2011  -1.3
# 3: 01/03/2011  -1.1
# 4: 01/04/2011  -1.2
# 
# [[2]]
#          Date  Temp
# 1: 01/01/2011  -1.2
# 2: 01/02/2011  -1.3
# 3: 01/03/2011  -1.1
# 4: 01/04/2011  -1.2

Upvotes: 3

user20650
user20650

Reputation: 25854

You could add a couple of lines prior to your read.table in your loop

Use readLines to read in the data

r <- readLines(textConnection("Logger 001
               Alarm 1
               Alarm 2
               Alarm 3
               Alarm 4
               Date, Temp
               01/01/2011, -1.2
               01/02/2011, -1.3
               01/03/2011, -1.1
               01/04/2011, -1.2"))

[but without the textConnection for you ie r <- readLines("yourcsv")]

Find the row number where the actual headers begin - using grep

dt <- grep("Date",r)

Read in your data - skipping the lines prior to the headers

read.table(text=r , header=TRUE, sep="," , skip = dt-1)


So to read in your multiple csv files - these will be stored in a list of data,frames

 df.lst <- lapply(csv.list , function(i) {
                        r <- readLines(i)
                        dt <- grep("Date",r)
                        read.table(text=r , header=TRUE, sep="," , skip = dt-1)
                         })

Upvotes: 6

jbaums
jbaums

Reputation: 27388

count.fields can identify the first line of each file that has two comma-separated fields. You can then use this to specify skip. For example:

writeLines('Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2', f <- tempfile())

writeLines('Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Alarm 5
Alarm 6
Alarm 7
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2', f2 <- tempfile())

for (x in c(f, f2))  {
  ind <- match(2, count.fields(x, ','))
  df <- read.csv(x, header = T, skip=ind - 1)
  print(df)
}

#         Date Temp
# 1 01/01/2011 -1.2
# 2 01/02/2011 -1.3
# 3 01/03/2011 -1.1
# 4 01/04/2011 -1.2
# Date Temp
# 1 01/01/2011 -1.2
# 2 01/02/2011 -1.3
# 3 01/03/2011 -1.1
# 4 01/04/2011 -1.2

Upvotes: 1

Related Questions