Reputation: 2773
I need to read a csv file in R. But the file contains some text information in some rows instead of comma values. So i cannot read that file using read.csv(fileName) method. The content of the file is as follows:
name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz
I need to store only values of each name,date pair as data frame. To do that how can i read that file?
Actually my required output is
>dataFrame1
abc,2,saa
anan,3,ds
ama,ds,az
>dataFrame2
snans,32,asa
asa,2,saz
Upvotes: 1
Views: 387
Reputation: 60924
I would read the entire file first as a list of characters, i.e. a string for each line in the file, this can be done using readLines
. Next you have to find the places where the data for a new date starts, i.e. look for ,,
, see grep
for that. Then take the first entry of each data block, e.g. using str_extract
from the stringr
package. Finally, you need split all the remaing data strings, see strsplit
for that.
Upvotes: 4
Reputation: 81683
You can read the data with scan
and use grep
and sub
functions to extract the important values.
The text:
text <- "name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz"
These commands generate a data frame with name and date values.
# read the text
lines <- scan(text = text, what = character())
# find strings staring with 'name' or 'date'
nameDate <- grep("^name|^date", lines, value = TRUE)
# extract the values
values <- sub("^name:|^date:", "", nameDate)
# create a data frame
dat <- as.data.frame(matrix(values, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("name", "date"))))
The result:
> dat
name date
1 russel 21-2-1991
2 rus 23-3-1998
Update
To extract the values from the strings, which do not contain name and date information, the following commands can be used:
# read data
lines <- readLines(textConnection(text))
# split lines
splitted <- strsplit(lines, ",")
# find positions of 'name' lines
idx <- grep("^name", lines)[-1]
# create grouping variable
grp <- cut(seq_along(lines), c(0, idx, length(lines)))
# extract values
values <- tapply(splitted, grp, FUN = function(x)
lapply(x, function(y)
if (length(y) == 3) y))
create a list of data frames
dat <- lapply(values, function(x) as.data.frame(matrix(unlist(x),
ncol = 3, byrow = TRUE)))
The result:
> dat
$`(0,7]`
V1 V2 V3
1 abc 2 saa
2 anan 3 ds
3 ama ds az
$`(7,9]`
V1 V2 V3
1 snans 32 asa
2 asa 2 saz
Upvotes: 8