Cenoc
Cenoc

Reputation: 11662

Processing a variable space delimited file limited into 2 columns

For whatever reason data is being provided in the following format:

0001 This is text for 0001
0002 This has spaces in between
0003 Yet this is only supposed to be two columns
0009 Why didn't they just comma delimit you may ask?
0010 Or even use quotations?
001  Who knows
0012 But now I'm here with his file
0013 And hoping someone has an elegant solution?

So the above is supposed to be two columns. What I would like to have is a column for the first entries, ie 0001,0002,0003,0009,0010,001,0012,0013 and another column for everything else.

Upvotes: 4

Views: 113

Answers (4)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

I would recommend the input.file function from the "iotools" package.

Usage would be something like:

library(iotools)
input.file("yourfile.txt", formatter = dstrsplit, nsep = " ", col_types = "character")

Here's an example. (I've just created a dummy temporary file in my workspace for the purpose of illustration).

x <- tempfile()
writeLines(c("0001 This is text for 0001",
             "0002 This has spaces in between",
             "0003 Yet this is only supposed to be two columns",
             "0009 Why didn't they just comma delimit you may ask?",
             "0010 Or even use quotations?",
             "001  Who knows",
             "0012 But now I'm here with his file",
             "0013 And hoping someone has an elegant solution?"), con = x)

library(iotools)
input.file(x, formatter = dstrsplit, nsep = " ", col_types = "character")
#   rowindex                                              V1
# 1     0001                           This is text for 0001
# 2     0002                      This has spaces in between
# 3     0003     Yet this is only supposed to be two columns
# 4     0009 Why didn't they just comma delimit you may ask?
# 5     0010                         Or even use quotations?
# 6      001                                       Who knows
# 7     0012                  But now I'm here with his file
# 8     0013     And hoping someone has an elegant solution?

Elegant enough? ;-)


Update 1

If you've already read the data in as a single-column data.frame (as in @Jaap's answer), you can still benefit from the extreme speed of the "iotools" package by using the formatter directly, rather than calling it in the input.file function.

In other words, use:

dstrsplit(as.character(mydf$V1), nsep = " ", col_types = "character")

Update 2

In case anyone is interested, I benchmarked the solutions proposed by Jaap, and akrun against the "iotools" approach. You can find the results at this Gist. Summary: Whether dealing with a file on disk or a column of a file in memory, "iotoos" is the best performer. I did not test tomtom's solution because it would require further processing from what is already in their answer.

Upvotes: 3

akrun
akrun

Reputation: 887108

We can use tstrsplit from data.table. We convert the 'data.frame' to 'data.table' (setDT(mydf)), using tstrsplit on the column "V1", we split by space that follows a number (regex lookaround).

library(data.table)
res <- setDT(mydf)[, tstrsplit(V1, "(?<=\\d)\\s+", perl=TRUE)]
res
#     V1                                             V2
#1: 0001                          This is text for 0001
#2: 0002                     This has spaces in between
#3: 0003    Yet this is only supposed to be two columns
#4: 0009 Why didnt they just comma delimit you may ask?
#5: 0010                        Or even use quotations?
#6:  001                                      Who knows
#7: 0012                  But now Im here with his file
#8: 0013    And hoping someone has an elegant solution?

If needed, the names can be changed with setnames

setnames(res, c("nr", "text"))

Upvotes: 0

Jaap
Jaap

Reputation: 83215

You can use the separate function from the tidyr package for that (promoting my comment to an answer). You specify two column-names and with the extra = "merge" parameter you make sure that everything after the first space is put into the second column:

library(tidyr)
separate(mydf, V1, c("nr","text"), sep = " ", extra = "merge")
# or:
mydf %>% separate(V1, c("nr","text"), sep = " ", extra = "merge")

you get:

    nr                                           text
1 0001                          This is text for 0001
2 0002                     This has spaces in between
3 0003    Yet this is only supposed to be two columns
4 0009 Why didnt they just comma delimit you may ask?
5 0010                        Or even use quotations?
6  001                                      Who knows
7 0012                  But now Im here with his file
8 0013    And hoping someone has an elegant solution?

Used data:

mydf <- structure(list(V1 = structure(c(1L, 2L, 3L, 4L, 6L, 5L, 7L, 8L), 
                                      .Label = c("0001 This is text for 0001", "0002 This has spaces in between",
                                                 "0003 Yet this is only supposed to be two columns", "0009 Why didnt they just comma delimit you may ask?", 
                                                 "001  Who knows", "0010 Or even use quotations?", "0012 But now Im here with his file", "0013 And hoping someone has an elegant solution?"), class = "factor")), 
              .Names = "V1", class = "data.frame", row.names = c(NA,-8L))

Upvotes: 5

tomtom
tomtom

Reputation: 259

You may want to use the following (for instance in an lapply loop):

unlist(strsplit(gsub("([0-9]{1,}) ","\\1~",x), "~" ))

What it does is the following: gsub retains anything in between the parentheses ( and ) and stores it in the variable \\1. The [0-9] finds any number, and {1,} right behind it allows one or more occurences. Therefore, you first replace the space between the numbers and the text with a tilde (or anything else that isn't in your text) and then strsplit based on that.

Upvotes: 0

Related Questions