How to create a dataframe from a site which appears to store each row as a list?

Question

Thanks in advance for the help. Essentially, I was testing obtaining data off websites, when I ran across this one: http://lib.stat.cmu.edu/datasets/sleep. I proceeded in the following fashion:

(A) Get a sense of the data (in R): I essentially typed the following

readLines("http://lib.stat.cmu.edu/datasets/sleep", n=100)

(B) I notice that the data I would want really starts on the 51st line, so I write this code:

sleep_table <- read.table("http://lib.stat.cmu.edu/datasets/sleep", header=FALSE, skip=50)

(C) I get the following error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
line 1 did not have 14 elements

Where I got the above approach was from another question on stack overflow (import dat file into R). However, this question deals with a .dat file and my question is with data at a particular URL. What I'd like to know is how do I get the data from line 51 down (if you used readLines) into a dataframe with no headers (I'll add those in later with a colnames(sleep_table) <- c("etc.", "etc2", "etc3"...).

G. Grothendieck · Accepted Answer

Use the fact that the good lines end in a one digit field and that every field except the first is numeric:

URL <- "http://lib.stat.cmu.edu/datasets/sleep"
L <- readLines(URL)

# lines ending in a one digit field
good.lines <- grep(" \d$", L, value = TRUE)

# insert commas before numeric fields
lines.csv <- gsub("( [-0-9.])", ",\1", good.lines)

# re-read
DF <- read.table(text = lines.csv, sep = ",", as.is = TRUE, strip.white = TRUE, 
         na.strings = "-999.0")

If you are interested in the headings too here is some code for that. Omit the rest if you are not interested in headings.

# get headings - of the lines starting at left edge these are the ncol(DF) lines
#  starting with the one containing "species"
headings0 <- grep("^[^ ]", L, value = TRUE)
i <- grep("species", headings0)
headings <- headings0[seq(i, length = ncol(DF))]

# The headings are a bit long so we shorten them to the first word
names(DF) <- sub(" .*$", "", headings)

This gives:

> head(DF)
                    species     body  brain slow paradoxical total maximum
1          African elephant 6654.000 5712.0   NA          NA   3.3    38.6
2 African giant pouched rat    1.000    6.6  6.3         2.0   8.3     4.5
3                Arctic Fox    3.385   44.5   NA          NA  12.5    14.0
4    Arctic ground squirrel    0.920    5.7   NA          NA  16.5      NA
5            Asian elephant 2547.000 4603.0  2.1         1.8   3.9    69.0
6                    Baboon   10.550  179.5  9.1         0.7   9.8    27.0
  gestation predation sleep overall
1       645         3     5       3
2        42         3     1       3
3        60         1     1       1
4        25         5     2       3
5       624         3     5       4
6       180         4     4       4

UPDATE: minor simplification in white space trimming

UPDATE 2: shorten headings

UPDATE 3: added na.strings = "-999.0"

How to create a dataframe from a site which appears to store each row as a list?

Answers (2)

Related Questions