Reputation: 1
I have been given a dataset as a .txt format that I need to import in R for a research analysis. The .txt file has one column with numeric characters, no header, and no delimiter. These are a couple of the examples in the .txt file:
4878578572809275874037093859845083594859474905704627402739385785748756 0934893758795493758745846784678576857458708476968983984980985974687586 3989458476857609379087685796847586770493706759787398499485957658968590
I want to know how I can separate these characters in R according to its variables. So, I want to separate the first 3 numbers to be for variable 1, the next 5 numbers to be for variable 2, the next 2 numbers to be for variable 3, and so on.
Is there an expression I can use in R that will allow for me to separate the observations in the .txt file according to its required number of characters for each variable?
I have tried read.fwf but it did not work. Or maybe I tried it incorrectly.
Help with this issue will be much appreciated!
Upvotes: 0
Views: 1013
Reputation: 2143
library(utils)
raw <- "4878578572809275874037093859845083594859474905704627402739385785748756
0934893758795493758745846784678576857458708476968983984980985974687586
3989458476857609379087685796847586770493706759787398499485957658968590"
# Put your data in a temporary file. You shouldn't have to do this, you data
# is already sitting in a file.
ff <- tempfile()
cat(file = ff, raw)
Now read it back in with read.fwf
answer <- suppressWarnings(
read.fwf(ff, widths = c(3, 5, 2)))
# Remember to clean up after ourselves.
unlink(ff) # Again, you won't need to do this; your file isn't temporary.
answer
V1 V2 V3
1 487 85785 72
2 93 48937 58
3 398 94584 76
You can use regex (regular expressions). I've encoded the position breaks where you said to in your post:
library(tidyverse)
library(readr)
byRegx <- function(raw){
rawSpl <- str_match(raw[1], "(?x) (^\\d{3}) (\\d{5}) (\\d{2}) (.+)")[1,]
tibble(apples = rawSpl[2], bananas = rawSpl[3], carrots = rawSpl[4],
therestofthem = rawSpl[5])
}
Read your input to a table, then applied the byRegex
function
inputTbl<- tibble(
raw = readr::read_lines("4878578572809275874037093859845083594859474905704627402739385785748756
0934893758795493758745846784678576857458708476968983984980985974687586
3989458476857609379087685796847586770493706759787398499485957658968590")) %>%
mutate(morecol = map(str_trim(raw), byRegx)) %>%
unnest() %>%
select(- raw)
inputTbl
# A tibble: 3 x 4
# apples bananas carrots therestofthem
# <chr> <chr> <chr> <chr>
# 1 487 85785 72 809275874037093859845083594859474905704627402739385785748756
# 2 093 48937 58 795493758745846784678576857458708476968983984980985974687586
# 3 398 94584 76 857609379087685796847586770493706759787398499485957658968590
Upvotes: 1
Reputation: 2949
You can use the substring function.
txt <- "487857857280927587403709385984508359485947490570462740273938578574875609348937587954937587458467846785768574587084769689839849809859746875863989458476857609379087685796847586770493706759787398499485957658968590"
select <- c(3 , 5, 2) # vector of number of words to be extracted
out <- as.numeric(substring(txt,
cumsum(c(1, select[-length(select)])),
cumsum(select)))
out
#[1] 487 85785 72
Upvotes: 0