r3vdev
r3vdev

Reputation: 325

Looking for advice on processing variable length text files

I need to write some code to process a tab delimited text files that has defined segments with variable lengths in R.

The table included in each segment should start 1 line below "Group:" and stop either 2 lines above "~End" if the group is a control or 6 lines above "~End" if if the group is a standard. The lengths of the tables themselves will be variable and can be empty like the segment "SpikedControl".

and example file looks like this:

Group:  Controls    1
Sample  Wells   Values  MeanValue   CV  Od-bkgd-blank
Anti-Hu Det A11 2.849   2.855   0.282   2.853
    A23 2.860           
Coat Control    A12 0.161   0.160   0.530   0.159
    A24 0.160           
Diluent Standard 1  A9  0.114   0.113   1.379   0.104
    A21 0.112           
Diluent Standard 2  A8  0.012   0.013   2.817   0.012
    A20 0.013           

~End
Group:  SpikedControl   1
Sample  Wells   Concentration   Values  MeanValue   CV  ODbkgdblank Conc    %Expected

~End
Group:  Standards   1
Sample  ExpConc Wells   OD  CV OD   ODblank MeanODBlank Result  %Recovery
St001   2000.000    B1  2.939   1.4 2.932   2.904   Range?  Range?
        B13 2.882       2.875       1153.779    57.689
St002   666.667 B2  2.820   0.9 2.812   2.829   456.435 68.465
        B14 2.855       2.847       670.358 100.554
St003   222.222 B3  2.677   0.9 2.669   2.686   208.849 93.982
        B15 2.709       2.702       237.852 107.033
St004   74.074  B4  2.215   1.4 2.205   2.226   72.185  97.449
        B16 2.258       2.248       77.452  104.560
St005   24.691  B5  1.406   1.3 1.397   1.410   24.296  98.397
        B17 1.433       1.424       25.153  101.868
St006   8.230   B6  0.669   2.9 0.658   0.672   7.781   94.536
        B18 0.697       0.686       8.240   100.115
St007   2.743   B7  0.357   5.8 0.348   0.334   3.143   114.579
        B19 0.329       0.320       2.759   100.579
St008   0.914   B8  0.198   3.7 0.191   0.186   1.029   112.551
        B20 0.188       0.181       0.895   97.891
St009   0.305   B9  0.163   7.8 0.154   0.146   0.532   174.477
        B21 0.146       0.137       0.296   97.190
St010   0.102   B10 0.130   5.1 0.123   0.119   0.096   94.087
        B22 0.121       0.114       Range?  Range?
St011   0.034   B11 0.133   4.7 0.126   0.122   0.134   394.778
        B23 0.125       0.117       Range?  Range?
St012   0.011   B12 0.117   0.7 0.105   0.104   Range?  Range?
        B24 0.115       0.104       Range?  Range?

EC50 = 28.085

AUC = 5565.432

~End

I am not very experienced with processing text files like this, and am looking on some advice on how to approach identifying these segments and reading the tables within.

Thanks!

Edit - Link to example file:

https://www.dropbox.com/s/4m0lmbbequmpd9b/ExampleFile.txt?dl=0

PS: these files are spit out from a spectrophotometer so I don't have any control over the format as the software is pretty antiquated.

Edit 2 - Making some progress:

Read in file and get start and end lines for each segment

inputtext <- readLines("ExampleFile.txt")

starts <- grep("Group:", inputtext)
ends <- grep("~End", inputtext)

which I can follow up with

test2 <- read.table("ExampleFile.txt", header = T, sep = "\t", skip = 17, nrows = 24, blank.lines.skip = F)

now I am just trying to figure out how to accurately identify the number of rows to read.

So if the start row is 17, and the end row is 48. Then nrows needs to 24 which is 48 (end row indicated) - 17 (the first rows that are skipped) - 7 (to account for the header line and lines of fluff on the end of the table, which could also be 4 if it is a control table)

Now I just need to figure out how to loop this and properly identify whether the group is a control or standard to subtract the right amount of fluff.

Upvotes: 1

Views: 46

Answers (2)

Paul Rougieux
Paul Rougieux

Reputation: 11399

You can use a nested data frame for that purpose. I detailed the method in 3 steps below.

library(tidyverse)
inputtext <- readLines("~/downloads/ExampleFile.txt")

1. Read data in a single data frame, create a column with the group name

dtf1 <- data_frame(input = inputtext) %>% 
    separate(input, c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9"), sep = "\t") %>% 
    mutate(group = ifelse(grepl("Group:", x1), x2, NA)) %>% 
    fill(group) %>% 
    filter(!is.na(x4)) 
head(dtf1)

2. Nest the data frame

dtf2 <- dtf1 %>% 
    group_by(group) %>% 
    nest() 
dtf2$data[[1]]

Give column names from the data to the first nested data frame

colnames1 <- dtf2$data[[1]] %>% slice(1) %>% unlist()
colnames1[is.na(colnames1)] <- names(colnames1[is.na(colnames1)])
colnames(dtf2$data[[1]]) <- colnames1

3. Give column names from the data to each sub data frame

    dtf3 <- dtf2 %>% 
    mutate(names = map(data, slice, 1),
           names = map(names, unlist),
           names = map(names,
                       function(x){ # Replace NA column names by the default x_ names
                           x[is.na(x)] <- names(x[is.na(x)])
                           return(x)
                       }),
           data = map2(data, names, setNames), 
           data = map(data, slice, -1)) 

You now have a list of data frames. You can use the group name to call the corresponding data frame:

 > dtf3$data[dtf3$group=="Controls"][[1]]
# A tibble: 8 x 9
  Sample             Wells Values MeanValue CV    `Od-bkgd-blank` x7    x8    x9   
  <chr>              <chr> <chr>  <chr>     <chr> <chr>           <chr> <chr> <chr>
1 Anti-Hu Det        A11   2.849  2.855     0.282 2.853           NA    NA    NA   
2 ""                 A23   2.860  ""        ""    ""              NA    NA    NA   
3 Coat Control       A12   0.161  0.160     0.530 0.159           NA    NA    NA   
4 ""                 A24   0.160  ""        ""    ""              NA    NA    NA   
5 Diluent Standard 1 A9    0.114  0.113     1.379 0.104           NA    NA    NA   
6 ""                 A21   0.112  ""        ""    ""              NA    NA    NA   
7 Diluent Standard 2 A8    0.012  0.013     2.817 0.012           NA    NA    NA   
8 ""                 A20   0.013  ""        ""    ""              NA    NA    NA   
> dtf3$data[dtf3$group=="Standards"][[1]]
# A tibble: 24 x 9
   Sample ExpConc  Wells OD    `CV OD` ODblank MeanODBlank Result   `%Recovery`
   <chr>  <chr>    <chr> <chr> <chr>   <chr>   <chr>       <chr>    <chr>      
 1 St001  2000.000 B1    2.939 1.4     2.932   2.904       Range?   Range?     
 2 ""     ""       B13   2.882 ""      2.875   ""          1153.779 57.689     
 3 St002  666.667  B2    2.820 0.9     2.812   2.829       456.435  68.465     
 4 ""     ""       B14   2.855 ""      2.847   ""          670.358  100.554    
 5 St003  222.222  B3    2.677 0.9     2.669   2.686       208.849  93.982     
 6 ""     ""       B15   2.709 ""      2.702   ""          237.852  107.033    
 7 St004  74.074   B4    2.215 1.4     2.205   2.226       72.185   97.449     
 8 ""     ""       B16   2.258 ""      2.248   ""          77.452   104.560    
 9 St005  24.691   B5    1.406 1.3     1.397   1.410       24.296   98.397     
10 ""     ""       B17   1.433 ""      1.424   ""          25.153   101.868    
# ... with 14 more rows
> 

Note: rename based on this answer.

Placed source data as a gist in case it gets lost: spectrophotometer.txt

Upvotes: 0

r3vdev
r3vdev

Reputation: 325

I ended up doing the following:

library(tidyverse)

inputtext <- readLines("Test.txt")

starts <- grep("Group:", inputtext)
ends <- grep("~End", inputtext)

realend <- (ends - starts) - 2

dff <- list()

for (i in 1:4) {

  dff[[i]] <- read.table("Test.txt",
                   header = T,
                   sep = "\t",
                   skip = starts[i],
                   nrows = realend[i],
                   blank.lines.skip = F,
                   row.names = NULL)

}

dff <- lapply(dff, function(x) {x[!is.na(x$Values),]})

dff <- dff[sapply(dff, function(x) dim(x)[1]) > 0]

names(dff) <- letters[1:length(dff)]

list2env(dff,.GlobalEnv)

Upvotes: 1

Related Questions