Alium Britt
Alium Britt

Reputation: 1316

R - text input file format and acceptable header characters or fields

I have a data file that I need to read into R but am running into problems, and to resolve them I've been trying to find some kind of guide to specific header information that R can accept/read in a text input file. Unfortunately, I haven't been able to find anything relating to what the input file looks like - only about the commands used to import various file types.

As to my specific situation, I have a text file (with the extension .dat) that starts with various lines giving additional information about the various columns in the file that start with @, and followed by standard CSV layout. I'm guessing that the lines starting with @ can be read in and affect the structure of my data frame after input, although it's possible that this format isn't used by R. I'm also doing all of this in RStudio on Ubuntu with R version 3.0.2.

The text file looks like this:

@relation bupa
@attribute Mcv integer [65.0, 103]
@attribute Alkphos integer [23.0, 138]
@attribute Sgpt integer [4.0, 155]
@attribute Sgot integer [5.0, 82]
@attribute Gammagt integer [5.0, 297]
@attribute Drinks real [0.0, 20.0]
@attribute Selector {1,2}
@inputs Mcv, Alkphos, Sgpt, Sgot, Gammagt, Drinks
@outputs Selector
@data
85.0, 92.0, 45.0, 27.0, 31.0, 0.0, 1
85.0, 64.0, 59.0, 32.0, 23.0, 0.0, 2
...

Now, I could simply skip these rows as unnecessary and just start reading from the actual data lines, but I'd like to try and bring this data in if I can.

In case this is just an issue with the command I'm using to import, the specific code I've used to import their associated error messages are:

> bupa2 <- read.csv("/bupa/bupa.dat", sep=",", header=T)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  duplicate 'row.names' are not allowed
> bupa2 <- read.csv("/bupa/bupa.dat", sep=", ")
Error in scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE,  : 
  invalid 'sep' value: must be one byte
> bupa2 <- read.csv("/bupa/bupa.dat", sep=",")
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  duplicate 'row.names' are not allowed
> bupa2 <- read.table("/bupa/bupa.dat", sep=",")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 9 did not have 2 elements
> bupa2 <- read.table("/bupa/bupa.dat")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 5 elements
> bupa2 <- scan("/bupa/bupa.dat")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '@relation'

What kinds of fields can be accepted by R in a text input file before the data? Is this file an R-supported format? Is there a special command associated with this format that I can use to import it?

Thank you.

Upvotes: 0

Views: 699

Answers (1)

IRTFM
IRTFM

Reputation: 263421

Somebody's already done the necessary work for this format. There is a function in package:: foreign to read '.arff' files.

#Perhaps
library(foreign)
bupa <- read.arff( file="yourTextFileName.ext")

This is what I get when testing on the file scraped from that github link that seems to be the basis of your file:

> str(bupa)
'data.frame':   345 obs. of  7 variables:
 $ MCV     : num  85 85 86 91 87 98 88 88 92 90 ...
 $ alkphos : num  92 64 54 78 70 55 62 67 54 60 ...
 $ sgpt    : num  45 59 33 34 12 13 20 21 22 25 ...
 $ sgot    : num  27 32 16 24 28 17 17 11 20 19 ...
 $ gammagt : num  31 23 54 36 10 17 9 11 7 5 ...
 $ drinks  : num  0 0 0 0 0 0 0.5 0.5 0.5 0.5 ...
 $ selector: Factor w/ 2 levels "1","2": 1 2 2 2 2 2 1 1 1 1 ...

Upvotes: 2

Related Questions