MorrisseyJ
MorrisseyJ

Reputation: 1271

Read ASCII text tabular format into R - list format

I am trying to read an ASCI text file into R. The format however is a little difficult, with categorical variables appearing as a list name would, rather than as a variable in one large dataset.

Here is an example of the data: https://cdiac.ess-dive.lbl.gov/ftp/ndp030/nation.1751_2014.ems

None of these work:

url <- 'https://cdiac.ess-dive.lbl.gov/ftp/ndp030/nation.1751_2014.ems'

read.table(url)
scan(url)

I can skip to different lines, but that doesn't solve things. Regardless of where i start i end up with the following errors:

I assume there should an easy way to import this. Obviously if this format is inherently hard to handle and i would have to read every line and then write a function break everything apart, then not to worry.

Any ideas on a simple way to handle this format?

p.s. I realize there is another source of this data stored as a CSV in a tidy format (https://cdiac.ess-dive.lbl.gov/trends/emis/tre_coun.html). However I am now wondering how I would solve this problem.

Upvotes: 0

Views: 218

Answers (1)

r2evans
r2evans

Reputation: 160667

Try this:

txt <- readLines('https://cdiac.ess-dive.lbl.gov/ftp/ndp030/nation.1751_2014.ems')

### demonstration of how to find the breaks:
nms <- grep("^[A-Za-z]+$", txt, value = TRUE)
head(nms)
# [1] "AFGHANISTAN" "ALBANIA"     "ALGERIA"     "ANDORRA"     "ANGOLA"     
# [6] "ANGUILLA"   
tail(nms)
# [1] "VANUATU"   "VENEZUELA" "YEMEN"     "ZAMBIA"    "ZANZIBAR"  "ZIMBABWE" 

The major work:

lists <- by(txt, cumsum(grepl("^[A-Za-z]+$", txt)), function(s) {
  ind <- grepl("^[0-9]", s)
  if (any(ind)) cbind(cntry = s[1], read.table(text = as.character(s[ind]), stringsAsFactors = FALSE))
})
lists <- Filter(length, lists)

head(lists[[1]])
#         cntry   V1 V2 V3 V4 V5 V6 V7   V8 V9
# 1 AFGHANISTAN 1949  4  0  0  4  .  0    .  0
# 2 AFGHANISTAN 1950 23  0 18  6  0  0 0.00  0
# 3 AFGHANISTAN 1951 25  0 18  7  0  0 0.00  0
# 4 AFGHANISTAN 1952 25  0 17  9  0  0 0.00  0
# 5 AFGHANISTAN 1953 29  0 18 10  0  0 0.00  0
# 6 AFGHANISTAN 1954 29  0 18 12  0  0 0.00  0

Combine them all:

alldat <- do.call(rbind, c(lists, list(stringsAsFactors = FALSE)))
head(alldat)
#           cntry   V1 V2 V3 V4 V5 V6 V7   V8 V9
# 1.1 AFGHANISTAN 1949  4  0  0  4  .  0    .  0
# 1.2 AFGHANISTAN 1950 23  0 18  6  0  0 0.00  0
# 1.3 AFGHANISTAN 1951 25  0 18  7  0  0 0.00  0
# 1.4 AFGHANISTAN 1952 25  0 17  9  0  0 0.00  0
# 1.5 AFGHANISTAN 1953 29  0 18 10  0  0 0.00  0
# 1.6 AFGHANISTAN 1954 29  0 18 12  0  0 0.00  0
tail(alldat)
#           cntry   V1   V2 V3   V4   V5 V6  V7   V8 V9
# 160.93 ZIMBABWE 2009 1528  0  455  977  0  95 0.11  6
# 160.94 ZIMBABWE 2010 2121  0  481 1531  0 109 0.15  7
# 160.95 ZIMBABWE 2011 2608  0  888 1584  0 136 0.18  8
# 160.96 ZIMBABWE 2012 2125  0 1006  917  0 201 0.15  9
# 160.97 ZIMBABWE 2013 3184  0 1119 1902  0 162 0.21  9
# 160.98 ZIMBABWE 2014 3278  0 1005 2097  0 177 0.22  9

(This can also be done with dplyr::bind_rows or data.table::rbindlist.)

The actual column names are a bit verbose and non-standard for R, I'll leave it up to you to come up with meaningful colnames.

Upvotes: 2

Related Questions