Reputation: 1975
Thanks in advance for the help. Essentially, I was testing obtaining data off websites, when I ran across this one: http://lib.stat.cmu.edu/datasets/sleep. I proceeded in the following fashion:
(A) Get a sense of the data (in R): I essentially typed the following
readLines("http://lib.stat.cmu.edu/datasets/sleep", n=100)
(B) I notice that the data I would want really starts on the 51st line, so I write this code:
sleep_table <- read.table("http://lib.stat.cmu.edu/datasets/sleep", header=FALSE, skip=50)
(C) I get the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 14 elements
Where I got the above approach was from another question on stack overflow (import dat file into R). However, this question deals with a .dat file and my question is with data at a particular URL. What I'd like to know is how do I get the data from line 51 down (if you used readLines) into a dataframe with no headers (I'll add those in later with a colnames(sleep_table) <- c("etc.", "etc2", "etc3"...).
Upvotes: 1
Views: 161
Reputation: 269862
Use the fact that the good lines end in a one digit field and that every field except the first is numeric:
URL <- "http://lib.stat.cmu.edu/datasets/sleep"
L <- readLines(URL)
# lines ending in a one digit field
good.lines <- grep(" \\d$", L, value = TRUE)
# insert commas before numeric fields
lines.csv <- gsub("( [-0-9.])", ",\\1", good.lines)
# re-read
DF <- read.table(text = lines.csv, sep = ",", as.is = TRUE, strip.white = TRUE,
na.strings = "-999.0")
If you are interested in the headings too here is some code for that. Omit the rest if you are not interested in headings.
# get headings - of the lines starting at left edge these are the ncol(DF) lines
# starting with the one containing "species"
headings0 <- grep("^[^ ]", L, value = TRUE)
i <- grep("species", headings0)
headings <- headings0[seq(i, length = ncol(DF))]
# The headings are a bit long so we shorten them to the first word
names(DF) <- sub(" .*$", "", headings)
This gives:
> head(DF)
species body brain slow paradoxical total maximum
1 African elephant 6654.000 5712.0 NA NA 3.3 38.6
2 African giant pouched rat 1.000 6.6 6.3 2.0 8.3 4.5
3 Arctic Fox 3.385 44.5 NA NA 12.5 14.0
4 Arctic ground squirrel 0.920 5.7 NA NA 16.5 NA
5 Asian elephant 2547.000 4603.0 2.1 1.8 3.9 69.0
6 Baboon 10.550 179.5 9.1 0.7 9.8 27.0
gestation predation sleep overall
1 645 3 5 3
2 42 3 1 3
3 60 1 1 1
4 25 5 2 3
5 624 3 5 4
6 180 4 4 4
UPDATE: minor simplification in white space trimming
UPDATE 2: shorten headings
UPDATE 3: added na.strings = "-999.0"
Upvotes: 3
Reputation: 10215
Since "Lesser short-tailed shrew" and "Pig" have unequal number of separator spaces, and the other fields are not tab-separated, read.table will not help. But luckily, this seems to be fixed space. Note that the solution is not complete, because there are a few nasty lines at the end of the record, and you probably have to convert the characters to number, but that's left as an easy exercise.
# 123456789012345689012345678901234568901234567890123456890123456789012345689012345678901234568901234567890123456890
# African elephant 6654.000 5712.000 -999.0 -999.0 3.3 38.6 645.0 3 5 3
# African giant pouched rat 1.000 6.600 6.3 2.0 8.3 4.5 42.0 3 1 3
sleep_table <- read.fwf("http://lib.stat.cmu.edu/datasets/sleep", widths = c(25,rep(8,10)),
header=FALSE, skip=51)
Upvotes: 3