Reputation: 35
The raw data (.txt) file has 65926 elements, each containing a 142 character string.
Here's the dropbox link to the raw data file.
The task is to divide that 142 character string into 37 smaller strings (each substring being a separate character variable), each one beginning sequentially at following positions: 1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,56,60,66,72,75,76,77,78,79,80,127,130,133
The final output must be a 65926 x 37 dataframe. Here's a snapshot of the output dataframe:
Here's the code I'm using:
x <- readLines("R71252L01.TXT")
a <- c(1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,56,60,66,72,75,76,77,78,79,80,127,130,133)
z <- data.frame(matrix(nrow = length(x), ncol = length(a)), stringsAsFactors = FALSE)
for (i in 1:length(x) ) {
z[i,] <- (list(
(c(substr(x[i], 1, 3),substr(x[i], 4, 8),substr(x[i], 9, 10),
substr(x[i], 11, 13),substr(x[i], 14, 14),substr(x[i], 15, 15),
substr(x[i], 16, 18),substr(x[i], 19, 20),substr(x[i], 21, 23),
substr(x[i], 24, 25),substr(x[i], 26, 26),substr(x[i], 27, 27),
substr(x[i], 28, 31),substr(x[i], 32, 32),substr(x[i], 33, 33),
substr(x[i], 34, 35),substr(x[i], 36, 37),substr(x[i], 38, 42),
substr(x[i], 43, 44),substr(x[i], 45, 45),substr(x[i], 46, 46),
substr(x[i], 47, 47),substr(x[i], 48, 51),substr(x[i], 52, 55),
substr(x[i], 56, 56),substr(x[i], 60, 65),substr(x[i], 66, 71),
substr(x[i], 72, 74),substr(x[i], 75, 75),substr(x[i], 76, 76),
substr(x[i], 77, 77),substr(x[i], 78, 78),substr(x[i], 79, 79),
substr(x[i], 80, 126),substr(x[i], 127, 129),substr(x[i], 130, 132),
substr(x[i], 133, 142)
) )
) )
i <- i+1
}
The code works but there are two issues:
substr()
have to be manually typed out. Any way to utilize the vector a
instead of all that manual labour?The code takes over 30 minutes to execute. Checking time:
> system.time(source('Hitesh_Script.R'))
user system elapsed
4452.464 9.440 4476.018
Can this be done faster?
I have to perform this task for several raw data files, each with a different vector a
. Hence any other suggestions for efficiency will also be appreciated. Thanks a ton!
Upvotes: 1
Views: 146
Reputation: 42544
It seems that the readr
package (part of Hadley's tidyverse) offers an even faster solution to read and split the fixed width file in one rush.
The given sample file on dropbox took 0.17 sec elapsed time on my system to read in and return the 65,926 × 37 data.frame.
library(readr)
a <- c(1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,56,60,66,72,
75,76,77,78,79,80,127,130,133)
z <- read_fwf("R71252L01.TXT", fwf_widths(diff(c(a, 142))),
col_types = stringr::str_dup("c", length(a)))
Output is a tibble, an improved data.frame
:
print(z, n = 3, width = Inf)
# A tibble: 65,926 × 37
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 000 37773 71 252 1 1 012 05 005 01
2 000 37773 71 252 1 1 012 05 005 01
3 000 37773 71 252 1 1 012 05 005 01
X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2 0110 1 2 01 01 00000 01 1
2 1 2 0110 1 2 02 01 00000 01 1
3 1 2 0110 1 2 03 01 00000 01 1
X21 X22 X23 X24 X25 X26 X27 X28 X29 X30
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 <NA> 6538 4001 <NA> 120314 310314 115 2 2
2 1 <NA> 6538 4001 <NA> 120314 310314 90 2 2
3 1 <NA> 6538 4001 <NA> 130314 310314 90 2 2
X31 X32 X33 X34 X35 X36 X37
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2 2 2 <NA> 1 2 21433
2 2 2 2 <NA> 1 2 21433
3 2 2 2 <NA> 1 2 21433
You can specify either field widths or start and end positions of fields. Computing widths from the given start positions a
using diff
required less coding. However, the end position (142) has to be specified any way.
To stay in line with the Q, I forced all columns to be of type character
with the parameter col_types = stringr::str_dup("c", length(a))
. If the OP needs the columns to be of other types, this can be specified as required or rely on the built-in type recognition, see help("read_fwf")
.
I've also tried read.fwf()
first but this is magnitudes slower (32.7 sec elapsed time).
Upvotes: 0
Reputation: 8413
I just simulated a scratch data and shared a simple code
a <- c(1,4,9,11,14,15,16,19,21,24,26,27,28,32,33,34,36,38,43,45,46,47,48,52,
56,60,66,72,75,76,77,78,79,80,127,130,133)
df = data.frame(
x = c("uiagdsjgcjkh bijacydgasxdhsfkajdh,cnfwkeyrg,urnyhvguirwljbhgkjgjgdkgkdgkgdkgdkgdkgdkgdkgdkgdkdgkdgkdgkdgkjdgkdwjbiuayeiuy ke,ynh lgdiyl",
"kjhfkjsdlfkojjhgckjasnhjhckjsybsanhdsabtgchbtsjahasijhcndkuysefiuwyhsnidxjnkausetfba jwf,mycaiusftbbawubbctejdgkjdghjgdduiacwyftacbm"))
# > df
# x
# 1 uiagdsjgcjkh bijacydgasxdhsfkajdh,cnfwkeyrg,urnyhvguirwljbhgkjgjgdkgkdgkgdkgdkgdkgdkgdkgdkgdkdgkdgkdgkdgkjdgkdwjbiuayeiuy ke,ynh lgdiyl
# 2 kjhfkjsdlfkojjhgckjasnhjhckjsybsanhdsabtgchbtsjahasijhcndkuysefiuwyhsnidxjnkausetfba jwf,mycaiusftbbawubbctejdgkjdghjgdduiacwyftacbm
df1 <- data.frame(lapply(df, function(x) lapply(seq_along(a), function(i){
if (i==length(a))
substr(x,a[i],nchar(as.character(x)))
else
substr(x,a[i],a[i+1]-1)}
)))
colnames(df1)=paste0("x",1:dim(df1)[2])
df1
# x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21
# 1 uia gdsjg cj kh b i jac yd gas xd h s fkaj d h ,c nf wkeyr g, u r
# 2 kjh fkjsd lf koj j h gck ja snh jh c k jsyb s a nh ds abtgc hb t s
# x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33
# 1 n yhvg uirw ljbh gkjgjg dkgkdg kgd k g d k g
# 2 j ahas ijhc ndku ysefiu wyhsni dxj n k a u s
# x34 x35 x36 x37
# 1 dkgdkgdkgdkgdkdgkdgkdgkdgkjdgkdwjbiuayeiuy ke,y nh lgd iyl
# 2 etfba jwf,mycaiusftbbawubbctejdgkjdghjgdduiacwy fta cbm
Upvotes: 3