Reputation: 1367
I have a big text file that is the result of another program appending multiple tables together (including headers). I'd like to read that file into R and split it apart.
This is sort of similar to this problem, except I don't know the row number, and I need to split it on the header row, not a particular row number. I do know that each table starts with a header, though, and that the Wvlgth values always start with 337.0 and end with 823.0.
Here is the text file. It looks identical to this, except with 550 rows.
Wvlgth Global
337.0 .4345
337.5 .1256
338.0 .8754
<...>
821.0 .9923
822.0 .7124
823.0 .2999
Wvlgth Global
337.0 .5632
337.5 .1245
338.0 .0012
<...>
821.0 .1987
822.0 .6743
823.0 .2045
Here is the code to generate something similar in R:
df = data.frame("Wvlgth" = c('337.0','337.5','338.0','821.0','822.0','823.0'),
"Global" = c(.4345, .1256, .8754, .9923, .7124, .2999))
I want this to turn into multiple dataframes, like so:
Wvlgth Global
1 337.0 .4345
2 337.5 .1256
3 338.0 .8754
<...>
548 821.0 .9923
549 822.0 .7124
550 823.0 .2999
Wvlgth Global
1 337.0 .5632
2 337.5 .1245
3 338.0 .0012
<...>
548 821.0 .1987
549 822.0 .6743
550 823.0 .2045
Not sure if there is a way to do this through read.csv or if i need to read the whole thing in and split it after the fact.
Upvotes: 1
Views: 158
Reputation: 1795
I created some monte carlo data myself
#loading libraries
library(stringi)
library(data.table)
library(plyr)
input <- readLines("path/to/your/csv",warn = F)#reading input csv file
input <- trimws(input)#removing spaces left and right
Looking like:
>input
[1] "Wvlgth Global" "337.0 0.4345" "337.5 0.1256" "338.0 0.8754" "821.0 0.9923" "822.0 0.7124" "823.0 0.2999" "Wvlgth Global" "327.0 0.5345" "317.5 0.5256"
[11] "358.0 0.4754" "871.0 0.93235" "882.0 0.2124" "893.0 0.1999" "811.0 0.93235" "972.0 0.33235" "Wvlgth Global" "893.0 0.2399" "193.0 0.5120" "892.0 0.3199"
In order to transform it to a more useful format (data.table):
dt<-data.table(ldply(stri_split(str = input,fixed=" "),"["))#creating data.table
dt[,Wvlgth:=as.numeric(V1)][,Global:=as.numeric(V2)][,V1:=NULL][,V2:=NULL]#performing some column manipulation (by importing as.numeric, the characters are transformed into NA's
dt[,containsNA:=is.na(Wvlgth)]#adding boolean tag if Wvlgth is NA
The resulting data table looks like:
>dt
Wvlgth Global containsNA
1: NA NA TRUE
2: 337.0 0.43450 FALSE
3: 337.5 0.12560 FALSE
4: 338.0 0.87540 FALSE
5: 821.0 0.99230 FALSE
6: 822.0 0.71240 FALSE
7: 823.0 0.29990 FALSE
8: NA NA TRUE
9: 327.0 0.53450 FALSE
10: 317.5 0.52560 FALSE
11: 358.0 0.47540 FALSE
12: 871.0 0.93235 FALSE
13: 882.0 0.21240 FALSE
14: 893.0 0.19990 FALSE
15: 811.0 0.93235 FALSE
16: 972.0 0.33235 FALSE
17: NA NA TRUE
18: 893.0 0.23990 FALSE
19: 193.0 0.51200 FALSE
20: 892.0 0.31990 FALSE
We then apply
l1<-split(dt,cumsum(dt$containsNA))
yielding in:
>l1
$`1`
Wvlgth Global containsNA
1 NA NA TRUE
2 337.0 0.4345 FALSE
3 337.5 0.1256 FALSE
4 338.0 0.8754 FALSE
5 821.0 0.9923 FALSE
6 822.0 0.7124 FALSE
7 823.0 0.2999 FALSE
$`2`
Wvlgth Global containsNA
8 NA NA TRUE
9 327.0 0.53450 FALSE
10 317.5 0.52560 FALSE
11 358.0 0.47540 FALSE
12 871.0 0.93235 FALSE
13 882.0 0.21240 FALSE
14 893.0 0.19990 FALSE
15 811.0 0.93235 FALSE
16 972.0 0.33235 FALSE
$`3`
Wvlgth Global containsNA
17 NA NA TRUE
18 893 0.2399 FALSE
19 193 0.5120 FALSE
20 892 0.3199 FALSE
Finally in order to acquire the format we desire (removing NA rows and containsNA
column that is) we perform on every element of the list the following operation:
lapply(l1,function(x) x[,.SD[(!is.na(Wvlgth))]][,containsNA:=NULL])
resulting in:
$`1`
Wvlgth Global
1: 337.0 0.4345
2: 337.5 0.1256
3: 338.0 0.8754
4: 821.0 0.9923
5: 822.0 0.7124
6: 823.0 0.2999
$`2`
Wvlgth Global
1: 327.0 0.53450
2: 317.5 0.52560
3: 358.0 0.47540
4: 871.0 0.93235
5: 882.0 0.21240
6: 893.0 0.19990
7: 811.0 0.93235
8: 972.0 0.33235
$`3`
Wvlgth Global
1: 893 0.2399
2: 193 0.5120
3: 892 0.3199
In case the link to the MC data gets deprecated, here is the data
that was used for the particular question.
Wvlgth Global
337.0 0.4345
337.5 0.1256
338.0 0.8754
821.0 0.9923
822.0 0.7124
823.0 0.2999
Wvlgth Global
327.0 0.5345
317.5 0.5256
358.0 0.4754
871.0 0.93235
882.0 0.2124
893.0 0.1999
811.0 0.93235
972.0 0.33235
Wvlgth Global
893.0 0.2399
193.0 0.5120
892.0 0.3199
Upvotes: 0
Reputation: 16277
Here's how to do that using split
from data.table
. Basically, you need to create a group column using cumsum
on "Wvlgth" in column 1. Then you can split the result into a list. You can then access the elements of that list like so: df_list[[1]]
df <- read.table(text="Wvlgth Global
337.0 .4345
337.5 .1256
338.0 .8754
821.0 .9923
822.0 .7124
823.0 .2999
Wvlgth Global
337.0 .5632
337.5 .1245
338.0 .0012
821.0 .1987
822.0 .6743
823.0 .2045",header=FALSE,stringsAsFactors=FALSE)
df$group <- cumsum(df[,1]=="Wvlgth")
df_list <- split(df, list(df$group))
$`1`
V1 V2 group
1 Wvlgth Global 1
2 337.0 .4345 1
3 337.5 .1256 1
4 338.0 .8754 1
5 821.0 .9923 1
6 822.0 .7124 1
7 823.0 .2999 1
$`2`
V1 V2 group
8 Wvlgth Global 2
9 337.0 .5632 2
10 337.5 .1245 2
11 338.0 .0012 2
12 821.0 .1987 2
13 822.0 .6743 2
14 823.0 .2045 2
To access a single data.frame:
df_list[[1]]
V1 V2 group
1 Wvlgth Global 1
2 337.0 .4345 1
3 337.5 .1256 1
4 338.0 .8754 1
5 821.0 .9923 1
6 822.0 .7124 1
7 823.0 .2999 1
Additionally, if you want to set up the column names of the data.frames, you can use lapply
:
new_col_name <- c("Wvlgth", "Global","group")
df_list <- lapply(df_list, setNames, nm = new_col_name) #set names
df_list <- lapply(df_list, function(x) x[-1,]) #remove first row
> df_list
$`1`
Wvlgth Global group
2 337.0 .4345 1
3 337.5 .1256 1
4 338.0 .8754 1
5 821.0 .9923 1
6 822.0 .7124 1
7 823.0 .2999 1
$`2`
Wvlgth Global group
9 337.0 .5632 2
10 337.5 .1245 2
11 338.0 .0012 2
12 821.0 .1987 2
13 822.0 .6743 2
14 823.0 .2045 2
Upvotes: 1