android16.5
android16.5

Reputation: 35

Creating a dataframe by concatenating substrings of non-uniform lengths

The raw data (.txt) file has 65926 elements, each containing a 142 character string.

Here's the dropbox link to the raw data file.

The task is to divide that 142 character string into 37 smaller strings (each substring being a separate character variable), each one beginning sequentially at following positions: 1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,56,60,66,72,75,76,77,78,79,80,127,130,133

The final output must be a 65926 x 37 dataframe. Here's a snapshot of the output dataframe:

Here's a snapshot of the output dataframe.

Here's the code I'm using:

x <- readLines("R71252L01.TXT")
a <- c(1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,56,60,66,72,75,76,77,78,79,80,127,130,133)
z <- data.frame(matrix(nrow = length(x), ncol = length(a)), stringsAsFactors = FALSE)
for (i in 1:length(x) ) {
z[i,] <- (list(
            (c(substr(x[i], 1, 3),substr(x[i], 4, 8),substr(x[i], 9, 10),
               substr(x[i], 11, 13),substr(x[i], 14, 14),substr(x[i], 15, 15),
               substr(x[i], 16, 18),substr(x[i], 19, 20),substr(x[i], 21, 23),
               substr(x[i], 24, 25),substr(x[i], 26, 26),substr(x[i], 27, 27),
               substr(x[i], 28, 31),substr(x[i], 32, 32),substr(x[i], 33, 33),
               substr(x[i], 34, 35),substr(x[i], 36, 37),substr(x[i], 38, 42),
               substr(x[i], 43, 44),substr(x[i], 45, 45),substr(x[i], 46, 46),
               substr(x[i], 47, 47),substr(x[i], 48, 51),substr(x[i], 52, 55),
               substr(x[i], 56, 56),substr(x[i], 60, 65),substr(x[i], 66, 71),
               substr(x[i], 72, 74),substr(x[i], 75, 75),substr(x[i], 76, 76),
               substr(x[i], 77, 77),substr(x[i], 78, 78),substr(x[i], 79, 79),
               substr(x[i], 80, 126),substr(x[i], 127, 129),substr(x[i], 130, 132),
               substr(x[i], 133, 142)
            ) )
        )    )
i <- i+1
}

The code works but there are two issues:

  1. The start and stop indices for substr() have to be manually typed out. Any way to utilize the vector a instead of all that manual labour?
  2. The code takes over 30 minutes to execute. Checking time:

    > system.time(source('Hitesh_Script.R'))
        user   system  elapsed 
    4452.464    9.440 4476.018 
    

    Can this be done faster?

I have to perform this task for several raw data files, each with a different vector a. Hence any other suggestions for efficiency will also be appreciated. Thanks a ton!

Upvotes: 1

Views: 146

Answers (2)

Uwe
Uwe

Reputation: 42544

It seems that the readr package (part of Hadley's tidyverse) offers an even faster solution to read and split the fixed width file in one rush.

The given sample file on dropbox took 0.17 sec elapsed time on my system to read in and return the 65,926 × 37 data.frame.

library(readr)

a <- c(1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,56,60,66,72,
       75,76,77,78,79,80,127,130,133)

z <- read_fwf("R71252L01.TXT", fwf_widths(diff(c(a, 142))),
              col_types = stringr::str_dup("c", length(a)))

Output is a tibble, an improved data.frame:

print(z, n = 3, width = Inf)

# A tibble: 65,926 × 37
     X1    X2    X3    X4    X5    X6    X7    X8    X9   X10
  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1   000 37773    71   252     1     1   012    05   005    01
2   000 37773    71   252     1     1   012    05   005    01
3   000 37773    71   252     1     1   012    05   005    01
    X11   X12   X13   X14   X15   X16   X17   X18   X19   X20
  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1     1     2  0110     1     2    01    01 00000    01     1
2     1     2  0110     1     2    02    01 00000    01     1
3     1     2  0110     1     2    03    01 00000    01     1
    X21   X22   X23   X24   X25    X26    X27   X28   X29   X30
  <chr> <chr> <chr> <chr> <chr>  <chr>  <chr> <chr> <chr> <chr>
1     1  <NA>  6538  4001  <NA> 120314 310314   115     2     2
2     1  <NA>  6538  4001  <NA> 120314 310314    90     2     2
3     1  <NA>  6538  4001  <NA> 130314 310314    90     2     2
    X31   X32   X33   X34   X35   X36   X37
  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1     2     2     2  <NA>     1     2 21433
2     2     2     2  <NA>     1     2 21433
3     2     2     2  <NA>     1     2 21433

Explanations

  • You can specify either field widths or start and end positions of fields. Computing widths from the given start positions a using diff required less coding. However, the end position (142) has to be specified any way.

  • To stay in line with the Q, I forced all columns to be of type character with the parameter col_types = stringr::str_dup("c", length(a)). If the OP needs the columns to be of other types, this can be specified as required or rely on the built-in type recognition, see help("read_fwf").

  • I've also tried read.fwf() first but this is magnitudes slower (32.7 sec elapsed time).

Upvotes: 0

joel.wilson
joel.wilson

Reputation: 8413

I just simulated a scratch data and shared a simple code

a <- c(1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,
       56,60,66,72,75,76,77,78,79,80,127,130,133)
df = data.frame(
  x = c("uiagdsjgcjkh bijacydgasxdhsfkajdh,cnfwkeyrg,urnyhvguirwljbhgkjgjgdkgkdgkgdkgdkgdkgdkgdkgdkgdkdgkdgkdgkdgkjdgkdwjbiuayeiuy ke,ynh lgdiyl", 
        "kjhfkjsdlfkojjhgckjasnhjhckjsybsanhdsabtgchbtsjahasijhcndkuysefiuwyhsnidxjnkausetfba jwf,mycaiusftbbawubbctejdgkjdghjgdduiacwyftacbm"))

# > df
# x
# 1 uiagdsjgcjkh bijacydgasxdhsfkajdh,cnfwkeyrg,urnyhvguirwljbhgkjgjgdkgkdgkgdkgdkgdkgdkgdkgdkgdkdgkdgkdgkdgkjdgkdwjbiuayeiuy ke,ynh lgdiyl
# 2 kjhfkjsdlfkojjhgckjasnhjhckjsybsanhdsabtgchbtsjahasijhcndkuysefiuwyhsnidxjnkausetfba jwf,mycaiusftbbawubbctejdgkjdghjgdduiacwyftacbm

df1 <- data.frame(lapply(df, function(x) lapply(seq_along(a), function(i){
                                              if (i==length(a))
                                                substr(x,a[i],nchar(as.character(x)))
                                              else
                                                substr(x,a[i],a[i+1]-1)}
                                        )))
colnames(df1)=paste0("x",1:dim(df1)[2])

df1
#    x1    x2 x3  x4 x5 x6  x7 x8  x9 x10 x11 x12  x13 x14 x15 x16 x17   x18 x19 x20 x21
# 1 uia gdsjg cj kh   b  i jac yd gas  xd   h   s fkaj   d   h  ,c  nf wkeyr  g,   u   r
# 2 kjh fkjsd lf koj  j  h gck ja snh  jh   c   k jsyb   s   a  nh  ds abtgc  hb   t   s

#   x22  x23  x24  x25    x26    x27 x28 x29 x30 x31 x32 x33
# 1   n yhvg uirw ljbh gkjgjg dkgkdg kgd   k   g   d   k   g
# 2   j ahas ijhc ndku ysefiu wyhsni dxj   n   k   a   u   s

#                                               x34 x35 x36 x37
# 1 dkgdkgdkgdkgdkdgkdgkdgkdgkjdgkdwjbiuayeiuy ke,y nh  lgd iyl
# 2 etfba jwf,mycaiusftbbawubbctejdgkjdghjgdduiacwy fta cbm    

Upvotes: 3

Related Questions