Reputation: 135
Im hoping someone can help me with a data import question, I think it may be an easy fix, but haven't found the answer. I have a large number of txt files containing antenna scans and I need to import them in a uniform configuration. The problem is that they all contain an irregular number of lines of diagnostic data about the antenna before the actual data starts. I need a function that can identify when the actual data begins, so I can then import it with the correct data in the correct columns. Basically, for each file, I need to identify the number of lines of diagnostic code, so I can specify skip=" " when inputing the file with read.delim or something similar.
Heres an example of one of the files that I'm talking about:
Power OFF @ 12:05:50 02/15/13
Power ON @ 12:06:03 02/15/13
Reader #1 12:06:03 02/15/13
Reader #2 12:06:03 02/15/13
Battery Voltage = 13.35 @ 13:00:00 02/15/13
Battery Voltage = 13.42 @ 14:00:00 02/15/13
Battery Voltage = 13.32 @ 15:00:00 02/15/13
Battery Voltage = 13.55 @ 16:00:00 02/15/13
Reader #2 02:57:40 02/17/13 LA 900 226000012999
Reader #2 02:57:40 02/17/13 LA 900 226000012999
Reader #2 02:57:40 02/17/13 LA 900 226000012999
Reader #2 02:57:40 02/17/13 LA 900 226000012999
Upvotes: 2
Views: 275
Reputation: 8846
A slight variation. This will return every line that begins with Reader and contains 7 elements separated by spaces. I noticed that the first two Reader lines were shorter, if this isn't always the case then this will of course not work.
antenna0 <- readLines("antenna.txt")
antenna0 <- antenna0[grep("^Reader", antenna0)]
antenna <- strsplit(antenna0, " ")
data.frame(do.call(rbind, antenna[sapply(antenna, length) == 7]))
# X1 X2 X3 X4 X5 X6 X7
#1 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#2 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#3 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#4 Reader #2 02:57:40 02/17/13 LA 900 226000012999
I suppose you could do this directly in grep
using more advanced regular expressions, but I don't have the chops.
edit:
Slight improvement relying more on grep
:
antenna0 <- readLines("antenna.txt")
antenna1 <- antenna0[grep("^Reader.*[\\s{4,}]", antenna0)]
antenna2 <- do.call(rbind, strsplit(antenna1, " "))
data.frame(antenna3)
# X1 X2 X3 X4 X5 X6 X7
#1 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#2 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#3 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#4 Reader #2 02:57:40 02/17/13 LA 900 226000012999
Short explainer:
[\\s{4,}]
means that any character string containing spaces (\\s
) of four or more in number ({4,}
) will be returned.
^Reader
means that any character string starting with the letter sequence Reader will be returned.
.*
combines the two patterns, functioning as an AND operator.
Upvotes: 0
Reputation: 43334
read.table
If you use readLines
to read in the text line-by-line, you can use grep
to search for the highest line number that matches "Battery Voltage" and use that for skip
.
read.table(file.txt,
skip = max(grep('Battery Voltage', readLines(file.txt))),
# set comment delimiting character to anything besides "#"
comment.char = '')
## V1 V2 V3 V4 V5 V6 V7
## 1 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 3 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
Note that a further cleaning (combining columns, formatting dates) is necessary.
read.fwf
It may make more sense to use read.fwf
(fixed width file) if column widths are consistent. You'll need to use na.omit
, complete.cases
, or some other means of getting rid of blank lines, as read.fwf
doesn't accept a blank.lines.skip
parameter like read.table
and its variants:
na.omit(read.fwf(file.txt,
widths = c(9, -1, 17, -1, 2, -1, 3, -1, 12),
skip = max(grep('Battery Voltage', readLines(file.txt))),
comment.char = ''))
## V1 V2 V3 V4 V5
## 2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 6 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 8 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
However, counting characters to figure out column widths is a pain (and error-prone).
readr::read_fwf
The readr
package makes dealing with fixed-width files slightly less annoying, and gives useful warnings when things aren't getting parsed ideally. It also provides parameters to parse dates and datetimes as you read in your data, which is convenient:
library(readr)
df <- read_fwf(file.txt,
fwf_widths(c(9, 18, 3, 4, NA)),
col_types = list('c', col_datetime('%H:%M:%S %m/%d/%y'),'c', 'i', 'd'),
skip = max(grep('Battery Voltage', readLines(file.txt))))
df <- df[complete.cases(df), ]
# or df <- na.omit(df)
# or if some NAs are possible, more robust:
# df <- df[colSums(!apply(df, 1, is.na)) > 0, ]
df
## # A tibble: 4 x 5
## X1 X2 X3 X4 X5
## <chr> <time> <chr> <int> <dbl>
## 1 Reader #2 2013-02-17 02:57:40 LA 900 2.26e+11
## 2 Reader #2 2013-02-17 02:57:40 LA 900 2.26e+11
## 3 Reader #2 2013-02-17 02:57:40 LA 900 2.26e+11
## 4 Reader #2 2013-02-17 02:57:40 LA 900 2.26e+11
Note the nicely parsed datetime and slightly simpler input method for column widths (you can let it guess with fwf_empty
, which works well if you have column names).
If you've got your widths and column types correct, anything incorrect will be entered as NA
, so if you use na.omit
, you may be able to avoid a skip
parameter altogether:
na.omit(read_fwf(file.txt,
fwf_widths(c(9, 18, 3, 4, 13)),
col_types = list('c', col_datetime('%H:%M:%S %m/%d/%y'),'c', 'i', 'd')))
## # A tibble: 4 x 5
## X1 X2 X3 X4 X5
## <chr> <time> <chr> <int> <dbl>
## 1 Reader #2 2013-02-17 02:57:40 LA 900 2.26e+11
## 2 Reader #2 2013-02-17 02:57:40 LA 900 2.26e+11
## 3 Reader #2 2013-02-17 02:57:40 LA 900 2.26e+11
## 4 Reader #2 2013-02-17 02:57:40 LA 900 2.26e+11
This approach is somewhat precarious, though, so should only be used if you can verify that it's working properly.
Upvotes: 2
Reputation: 293
You could read the file as a block of text and use grep
to identify the lines you want to get rid off. Here, I stored your block of text in test.txt
. Let's say your header goes all the way to the Battery Voltage
part, you could first identify the line numbers that contain Battery
and then find the last instance of it. That will be the number of lines to skip.
con = file('test.txt', 'r')
text = readLines(con)
close(con)
lines_to_skip = max(grep('Battery',text))
You should then read your data just fine.
> x = read.table('test.txt', skip=lines_to_skip, sep=' ', comment.char='')
> x
V1 V2 V3 V4 V5 V6 V7
1 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
3 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
Upvotes: 0
Reputation: 326
Are you always looking for the first line after the last line with "Battery Voltage"? If so, try this:
the.file <- readLines("C:\\Users\\myfile.txt")
row.to.begin.skip.at <- tail(grep("Battery Voltage", the.file), 1)
Then set skip=row.to.begin.skip.at
in you preferred reading function. The grep() function here searches each line for the phrase "Battery Voltage" then returns a list of all row numbers with that phrase in it. The tail( , 1) function returns the last row with the phrase in it.
Upvotes: 0