ale19
ale19

Reputation: 1367

R: split concatenated dataframe by row

I have a big text file that is the result of another program appending multiple tables together (including headers). I'd like to read that file into R and split it apart.

This is sort of similar to this problem, except I don't know the row number, and I need to split it on the header row, not a particular row number. I do know that each table starts with a header, though, and that the Wvlgth values always start with 337.0 and end with 823.0.

Here is the text file. It looks identical to this, except with 550 rows.

Wvlgth Global
337.0  .4345
337.5  .1256
338.0  .8754
<...>
821.0  .9923
822.0  .7124
823.0  .2999
Wvlgth Global
337.0  .5632
337.5  .1245
338.0  .0012
<...>
821.0  .1987
822.0  .6743
823.0  .2045

Here is the code to generate something similar in R:

    df = data.frame("Wvlgth" = c('337.0','337.5','338.0','821.0','822.0','823.0'), 
                   "Global" = c(.4345, .1256, .8754, .9923, .7124, .2999))

I want this to turn into multiple dataframes, like so:

Dataframe 1

     Wvlgth Global
1    337.0 .4345
2    337.5 .1256
3    338.0 .8754
<...>
548  821.0 .9923
549  822.0 .7124
550  823.0 .2999

Dataframe 2

     Wvlgth Global
1    337.0 .5632
2    337.5 .1245
3    338.0 .0012
<...>
548  821.0 .1987
549  822.0 .6743
550  823.0 .2045

Not sure if there is a way to do this through read.csv or if i need to read the whole thing in and split it after the fact.

Upvotes: 1

Views: 158

Answers (2)

amonk
amonk

Reputation: 1795

I created some monte carlo data myself

#loading libraries
library(stringi)
library(data.table)
library(plyr)

input <- readLines("path/to/your/csv",warn = F)#reading input csv file
input <- trimws(input)#removing spaces left and right

Looking like:

 >input
 [1] "Wvlgth Global" "337.0 0.4345"  "337.5 0.1256"  "338.0 0.8754"  "821.0 0.9923"  "822.0 0.7124"  "823.0 0.2999"  "Wvlgth Global" "327.0 0.5345"  "317.5 0.5256" 
[11] "358.0 0.4754"  "871.0 0.93235" "882.0 0.2124"  "893.0 0.1999"  "811.0 0.93235" "972.0 0.33235" "Wvlgth Global" "893.0 0.2399"  "193.0 0.5120"  "892.0 0.3199" 

In order to transform it to a more useful format (data.table):

dt<-data.table(ldply(stri_split(str = input,fixed=" "),"["))#creating data.table
dt[,Wvlgth:=as.numeric(V1)][,Global:=as.numeric(V2)][,V1:=NULL][,V2:=NULL]#performing some column manipulation (by importing as.numeric, the characters are transformed into NA's
dt[,containsNA:=is.na(Wvlgth)]#adding boolean tag if Wvlgth is NA

The resulting data table looks like:

>dt
     Wvlgth  Global containsNA
1:     NA      NA       TRUE
2:  337.0 0.43450      FALSE
3:  337.5 0.12560      FALSE
4:  338.0 0.87540      FALSE
5:  821.0 0.99230      FALSE
6:  822.0 0.71240      FALSE
7:  823.0 0.29990      FALSE
8:     NA      NA       TRUE
9:  327.0 0.53450      FALSE
10:  317.5 0.52560      FALSE
11:  358.0 0.47540      FALSE
12:  871.0 0.93235      FALSE
13:  882.0 0.21240      FALSE
14:  893.0 0.19990      FALSE
15:  811.0 0.93235      FALSE
16:  972.0 0.33235      FALSE
17:     NA      NA       TRUE
18:  893.0 0.23990      FALSE
19:  193.0 0.51200      FALSE
20:  892.0 0.31990      FALSE

We then apply

 l1<-split(dt,cumsum(dt$containsNA))

yielding in:

>l1
$`1`
  Wvlgth Global containsNA
1     NA     NA       TRUE
2  337.0 0.4345      FALSE
3  337.5 0.1256      FALSE
4  338.0 0.8754      FALSE
5  821.0 0.9923      FALSE
6  822.0 0.7124      FALSE
7  823.0 0.2999      FALSE

$`2`
   Wvlgth  Global containsNA
8      NA      NA       TRUE
9   327.0 0.53450      FALSE
10  317.5 0.52560      FALSE
11  358.0 0.47540      FALSE
12  871.0 0.93235      FALSE
13  882.0 0.21240      FALSE 
14  893.0 0.19990      FALSE
15  811.0 0.93235      FALSE
16  972.0 0.33235      FALSE

$`3`
  Wvlgth Global containsNA
17     NA     NA       TRUE
18    893 0.2399      FALSE
19    193 0.5120      FALSE
20    892 0.3199      FALSE

Finally in order to acquire the format we desire (removing NA rows and containsNA column that is) we perform on every element of the list the following operation:

lapply(l1,function(x) x[,.SD[(!is.na(Wvlgth))]][,containsNA:=NULL])

resulting in:

$`1`
   Wvlgth Global
1:  337.0 0.4345 
2:  337.5 0.1256
3:  338.0 0.8754
4:  821.0 0.9923
5:  822.0 0.7124
6:  823.0 0.2999

$`2`
    Wvlgth  Global
1:  327.0 0.53450
2:  317.5 0.52560
3:  358.0 0.47540
4:  871.0 0.93235
5:  882.0 0.21240
6:  893.0 0.19990
7:  811.0 0.93235
8:  972.0 0.33235

$`3`
   Wvlgth Global
1:    893 0.2399
2:    193 0.5120
3:    892 0.3199

ADDENDUM:

In case the link to the MC data gets deprecated, here is the data that was used for the particular question.

 Wvlgth Global
 337.0 0.4345
 337.5 0.1256
 338.0 0.8754
 821.0 0.9923
 822.0 0.7124
 823.0 0.2999
 Wvlgth Global
 327.0 0.5345
 317.5 0.5256
 358.0 0.4754
 871.0 0.93235
 882.0 0.2124
 893.0 0.1999
 811.0 0.93235
 972.0 0.33235
 Wvlgth Global
 893.0 0.2399
 193.0 0.5120
 892.0 0.3199

Upvotes: 0

Pierre Lapointe
Pierre Lapointe

Reputation: 16277

Here's how to do that using split from data.table. Basically, you need to create a group column using cumsum on "Wvlgth" in column 1. Then you can split the result into a list. You can then access the elements of that list like so: df_list[[1]]

df <- read.table(text="Wvlgth Global
337.0  .4345
                337.5  .1256
                338.0  .8754
                821.0  .9923
                822.0  .7124
                823.0  .2999
                Wvlgth Global
                337.0  .5632
                337.5  .1245
                338.0  .0012
                821.0  .1987
                822.0  .6743
                823.0  .2045",header=FALSE,stringsAsFactors=FALSE)
df$group <- cumsum(df[,1]=="Wvlgth")
df_list <- split(df, list(df$group))

$`1`
      V1     V2 group
1 Wvlgth Global     1
2  337.0  .4345     1
3  337.5  .1256     1
4  338.0  .8754     1
5  821.0  .9923     1
6  822.0  .7124     1
7  823.0  .2999     1

$`2`
       V1     V2 group
8  Wvlgth Global     2
9   337.0  .5632     2
10  337.5  .1245     2
11  338.0  .0012     2
12  821.0  .1987     2
13  822.0  .6743     2
14  823.0  .2045     2

To access a single data.frame:

df_list[[1]]
      V1     V2 group
1 Wvlgth Global     1
2  337.0  .4345     1
3  337.5  .1256     1
4  338.0  .8754     1
5  821.0  .9923     1
6  822.0  .7124     1
7  823.0  .2999     1

Additionally, if you want to set up the column names of the data.frames, you can use lapply:

new_col_name <- c("Wvlgth", "Global","group")
df_list <- lapply(df_list, setNames, nm = new_col_name) #set names
df_list <- lapply(df_list, function(x) x[-1,]) #remove first row

> df_list
$`1`
  Wvlgth Global group
2  337.0  .4345     1
3  337.5  .1256     1
4  338.0  .8754     1
5  821.0  .9923     1
6  822.0  .7124     1
7  823.0  .2999     1

$`2`
   Wvlgth Global group
9   337.0  .5632     2
10  337.5  .1245     2
11  338.0  .0012     2
12  821.0  .1987     2
13  822.0  .6743     2
14  823.0  .2045     2

Upvotes: 1

Related Questions