Mark
Mark

Reputation: 45

How to extract different type of data from a csv file in R

I have a csv file (dropbox link), which contains two types of information separated by a blank/empty line:

The header of the second table consists of 3 separate lines/rows. Ideally I would like to keep only one, which starts with ID.

I would like to import the above into R in a list format:

> print(data)

$experiment name
[1] "test1"

$date
[1] "01/01/2015"

$protocol
[1] "test2"

$name
[1] "John Doe"

$data
   ID name value
1  A1   AA   0.1
2  A2   BB  0.16
3  A3   CC  0.12
4  A4   DD  0.45
5  A5   EE  0.35
6  B1   FF   1.2
7  B2   GG  1.12
8  B3   HH  1.05
9  B4   II  1.21
10 B5   JJ  1.18

I have tried to scan the csv file and conditionally import the lines, but to no avail. Is there a base function or a set of functions that would allow me to import the two tables separated with an empty line, and conditionally read-in the second table from the ID character onwards?

I would greatly appreciate your help!

Upvotes: 1

Views: 150

Answers (2)

Miff
Miff

Reputation: 7941

A starting point would be:

data <- list()
all <- read.csv("data.csv", header=F, stringsAsFactors=F)
blank.line <- which(all[[1]]=="")
for (i in 1:(blank.line-1)) data[[all$V1[i]]] <- all$V2[i]
data$data <- read.csv("data.csv", skip=grep("^ID",all[,1])-1, header=T, stringsAsFactors=F)

the names of list elements in data aren't exactly as you've written - these could be fixed with gsub()ing the colons out if it is important.

Code edited in the light of comments about top table having a variable length

Upvotes: 1

agstudy
agstudy

Reputation: 121568

You can do something like this (not exactly what do you want but it is a good start) :

## use readlines since you unstrctred data
ll <- readLines("data.csv")     
list(
     ## I assume that the data is always in 3 first lines
     ## remove extra comma then use read.table   
     settings = read.table(text=gsub(',','',ll[seq(3)]),sep=':'),
     ## find where the data begin using grep
     data= read.table(text=ll[-seq(grep("ID",ll)-1)],header=TRUE,sep=','))

# $settings
#                V1         V2
# 1 experiment name      test1
# 2            date 01/01/2015
# 3        protocol      test2
# 
# $data
#    ID name value
# 1  A1   AA  0.10
# 2  A2   BB  0.16
# 3  A3   CC  0.12
# 4  A4   DD  0.45
# 5  A5   EE  0.35
# 6  B1   FF  1.20
# 7  B2   GG  1.12
# 8  B3   HH  1.05
# 9  B4   II  1.21
# 10 B5   JJ  1.18

Upvotes: 2

Related Questions