Reputation: 135

Importing irregular data in r

Im hoping someone can help me with a data import question, I think it may be an easy fix, but haven't found the answer. I have a large number of txt files containing antenna scans and I need to import them in a uniform configuration. The problem is that they all contain an irregular number of lines of diagnostic data about the antenna before the actual data starts. I need a function that can identify when the actual data begins, so I can then import it with the correct data in the correct columns. Basically, for each file, I need to identify the number of lines of diagnostic code, so I can specify skip=" " when inputing the file with read.delim or something similar.

Heres an example of one of the files that I'm talking about:

Power OFF @ 12:05:50 02/15/13 
Power ON  @ 12:06:03 02/15/13 
Reader #1 12:06:03 02/15/13 

Reader #2 12:06:03 02/15/13 

Battery Voltage = 13.35 @ 13:00:00 02/15/13 
Battery Voltage = 13.42 @ 14:00:00 02/15/13 
Battery Voltage = 13.32 @ 15:00:00 02/15/13 
Battery Voltage = 13.55 @ 16:00:00 02/15/13 

Reader #2 02:57:40 02/17/13 LA 900 226000012999

Reader #2 02:57:40 02/17/13 LA 900 226000012999

Reader #2 02:57:40 02/17/13 LA 900 226000012999

Reader #2 02:57:40 02/17/13 LA 900 226000012999

Upvotes: 2

Answers (4)

AkselA

Reputation: 8846

A slight variation. This will return every line that begins with Reader and contains 7 elements separated by spaces. I noticed that the first two Reader lines were shorter, if this isn't always the case then this will of course not work.

antenna0 <- readLines("antenna.txt")
antenna0 <- antenna0[grep("^Reader", antenna0)]
antenna <- strsplit(antenna0, " ")
data.frame(do.call(rbind, antenna[sapply(antenna, length) == 7]))

#      X1 X2       X3       X4 X5  X6           X7
#1 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#2 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#3 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#4 Reader #2 02:57:40 02/17/13 LA 900 226000012999

I suppose you could do this directly in grep using more advanced regular expressions, but I don't have the chops.

edit:

Slight improvement relying more on grep:

antenna0 <- readLines("antenna.txt")
antenna1 <- antenna0[grep("^Reader.*[\\s{4,}]", antenna0)]
antenna2 <- do.call(rbind, strsplit(antenna1, " "))
data.frame(antenna3)

#      X1 X2       X3       X4 X5  X6           X7
#1 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#2 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#3 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#4 Reader #2 02:57:40 02/17/13 LA 900 226000012999

Short explainer:

[\\s{4,}] means that any character string containing spaces (\\s) of four or more in number ({4,}) will be returned.

^Reader means that any character string starting with the letter sequence Reader will be returned.

.* combines the two patterns, functioning as an AND operator.

Upvotes: 0

alistaire

Reputation: 43334

`read.table`

If you use readLines to read in the text line-by-line, you can use grep to search for the highest line number that matches "Battery Voltage" and use that for skip.

read.table(file.txt, 
           skip = max(grep('Battery Voltage', readLines(file.txt))), 
           # set comment delimiting character to anything besides "#"
           comment.char = '')
##       V1 V2       V3       V4 V5  V6       V7
## 1 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 3 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11

Note that a further cleaning (combining columns, formatting dates) is necessary.

`read.fwf`

It may make more sense to use read.fwf (fixed width file) if column widths are consistent. You'll need to use na.omit, complete.cases, or some other means of getting rid of blank lines, as read.fwf doesn't accept a blank.lines.skip parameter like read.table and its variants:

na.omit(read.fwf(file.txt, 
                 widths = c(9, -1, 17, -1, 2, -1, 3, -1, 12), 
                 skip = max(grep('Battery Voltage', readLines(file.txt))), 
                 comment.char = ''))
##          V1                V2 V3  V4       V5
## 2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 6 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 8 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11

However, counting characters to figure out column widths is a pain (and error-prone).

`readr::read_fwf`

The readr package makes dealing with fixed-width files slightly less annoying, and gives useful warnings when things aren't getting parsed ideally. It also provides parameters to parse dates and datetimes as you read in your data, which is convenient:

library(readr)

df <- read_fwf(file.txt, 
               fwf_widths(c(9, 18, 3, 4, NA)), 
               col_types = list('c', col_datetime('%H:%M:%S %m/%d/%y'),'c', 'i', 'd'), 
               skip = max(grep('Battery Voltage', readLines(file.txt))))

df <- df[complete.cases(df), ]
# or df <- na.omit(df)
# or if some NAs are possible, more robust:
# df <- df[colSums(!apply(df, 1, is.na)) > 0, ]

df
## # A tibble: 4 x 5
##          X1                  X2    X3    X4       X5
##       <chr>              <time> <chr> <int>    <dbl>
## 1 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 2 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 3 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 4 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11

Note the nicely parsed datetime and slightly simpler input method for column widths (you can let it guess with fwf_empty, which works well if you have column names).

If you've got your widths and column types correct, anything incorrect will be entered as NA, so if you use na.omit, you may be able to avoid a skip parameter altogether:

na.omit(read_fwf(file.txt, 
                 fwf_widths(c(9, 18, 3, 4, 13)), 
                 col_types = list('c', col_datetime('%H:%M:%S %m/%d/%y'),'c', 'i', 'd')))
## # A tibble: 4 x 5
##          X1                  X2    X3    X4       X5
##       <chr>              <time> <chr> <int>    <dbl>
## 1 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 2 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 3 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 4 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11

This approach is somewhat precarious, though, so should only be used if you can verify that it's working properly.

Upvotes: 2

jgadoury

Reputation: 293

You could read the file as a block of text and use grep to identify the lines you want to get rid off. Here, I stored your block of text in test.txt. Let's say your header goes all the way to the Battery Voltage part, you could first identify the line numbers that contain Battery and then find the last instance of it. That will be the number of lines to skip.

con = file('test.txt', 'r')
text = readLines(con)
close(con)

lines_to_skip = max(grep('Battery',text))

You should then read your data just fine.

> x = read.table('test.txt', skip=lines_to_skip, sep=' ', comment.char='')
> x
  V1     V2       V3       V4 V5  V6       V7
1 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
3 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11

Upvotes: 0

Austin

Reputation: 326

Are you always looking for the first line after the last line with "Battery Voltage"? If so, try this:

the.file <- readLines("C:\\Users\\myfile.txt")
row.to.begin.skip.at <- tail(grep("Battery Voltage", the.file), 1)

Then set skip=row.to.begin.skip.at in you preferred reading function. The grep() function here searches each line for the phrase "Battery Voltage" then returns a list of all row numbers with that phrase in it. The tail( , 1) function returns the last row with the phrase in it.

Upvotes: 0

Importing irregular data in r

Answers (4)

read.table

read.fwf

readr::read_fwf

Related Questions

`read.table`

`read.fwf`

`readr::read_fwf`