Reputation: 301
I am trying to make a loop that reads in multiple CSV files that all have the same type of air temperature data. However there are rows that I want to skip above the data. These are "alarms" in the dataset. Each file may have a different amount of alarms, thus a different amount of rows to skip. See below:
-------------First CSV file---------------
Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2
-------------Second CSV file---------------
Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Alarm 5
Alarm 6
Alarm 7
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2
How can I get the index of Date
, Temp
to tell read.csv
to skip to that row?
for (i in 1:length(csv.list)) {
df = read.csv(csv.list[i], header = T, skip=????????)
}
Upvotes: 7
Views: 1413
Reputation: 969
I think there is a much simpler approach, which is reading the files as they are, with a few options.
df = read.csv('weather-logger2.csv', colClasses = c('character', 'numeric'),
col.names = c('Date', 'Temp'),
na.strings = 'Temp', fill = TRUE, stringsAsFactors = FALSE)
The first few lines contain strings that are not dates, which will give NA when you convert them to dates (you'll have to convert this way anyway for further processing), and then you just subset getting rid of the missing values:
Date Temp
1 Alarm 1 NA
2 Alarm 2 NA
3 Alarm 3 NA
4 Alarm 4 NA
5 Alarm 5 NA
6 Alarm 6 NA
7 Alarm 7 NA
8 Date NA
9 01/01/2011 -1.2
10 01/02/2011 -1.3
11 01/03/2011 -1.1
12 01/04/2011 -1.2
df$Date = as.Date(df$Date, '%d/%m/%Y')
df = subset(df, !is.na(Date))
to get:
Date Temp
9 2011-01-01 -1.2
10 2011-02-01 -1.3
11 2011-03-01 -1.1
12 2011-04-01 -1.2
Upvotes: 0
Reputation: 193517
fread
from the "data.table" package might be useful for you since it is pretty good at auto-detecting "junk" header rows. Here's an example:
First, create two sample csv files
cat("Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2
", file = "socsv1.csv", sep = "\n")
cat("Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Alarm 5
Alarm 6
Alarm 7
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2", file = "socsv2.csv", sep = "\n")
## Check that they were created
list.files(pattern = "socsv")
# [1] "socsv1.csv" "socsv2.csv"
Now, just use fread
and specify the sep
instead of letting fread
guess.
library(data.table)
lapply(list.files(pattern = "socsv"), fread, sep = ",")
# [[1]]
# Date Temp
# 1: 01/01/2011 -1.2
# 2: 01/02/2011 -1.3
# 3: 01/03/2011 -1.1
# 4: 01/04/2011 -1.2
#
# [[2]]
# Date Temp
# 1: 01/01/2011 -1.2
# 2: 01/02/2011 -1.3
# 3: 01/03/2011 -1.1
# 4: 01/04/2011 -1.2
Upvotes: 3
Reputation: 25854
You could add a couple of lines prior to your read.table
in your loop
Use readLines
to read in the data
r <- readLines(textConnection("Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2"))
[but without the textConnection for you ie r <- readLines("yourcsv")
]
Find the row number where the actual headers begin - using grep
dt <- grep("Date",r)
Read in your data - skipping the lines prior to the headers
read.table(text=r , header=TRUE, sep="," , skip = dt-1)
So to read in your multiple csv files - these will be stored in a list of data,frames
df.lst <- lapply(csv.list , function(i) {
r <- readLines(i)
dt <- grep("Date",r)
read.table(text=r , header=TRUE, sep="," , skip = dt-1)
})
Upvotes: 6
Reputation: 27388
count.fields
can identify the first line of each file that has two comma-separated fields. You can then use this to specify skip
. For example:
writeLines('Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2', f <- tempfile())
writeLines('Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Alarm 5
Alarm 6
Alarm 7
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2', f2 <- tempfile())
for (x in c(f, f2)) {
ind <- match(2, count.fields(x, ','))
df <- read.csv(x, header = T, skip=ind - 1)
print(df)
}
# Date Temp
# 1 01/01/2011 -1.2
# 2 01/02/2011 -1.3
# 3 01/03/2011 -1.1
# 4 01/04/2011 -1.2
# Date Temp
# 1 01/01/2011 -1.2
# 2 01/02/2011 -1.3
# 3 01/03/2011 -1.1
# 4 01/04/2011 -1.2
Upvotes: 1