Reputation: 11662
For whatever reason data is being provided in the following format:
0001 This is text for 0001
0002 This has spaces in between
0003 Yet this is only supposed to be two columns
0009 Why didn't they just comma delimit you may ask?
0010 Or even use quotations?
001 Who knows
0012 But now I'm here with his file
0013 And hoping someone has an elegant solution?
So the above is supposed to be two columns. What I would like to have is a column for the first entries, ie 0001,0002,0003,0009,0010,001,0012,0013
and another column for everything else.
Upvotes: 4
Views: 113
Reputation: 193517
I would recommend the input.file
function from the "iotools" package.
Usage would be something like:
library(iotools)
input.file("yourfile.txt", formatter = dstrsplit, nsep = " ", col_types = "character")
Here's an example. (I've just created a dummy temporary file in my workspace for the purpose of illustration).
x <- tempfile()
writeLines(c("0001 This is text for 0001",
"0002 This has spaces in between",
"0003 Yet this is only supposed to be two columns",
"0009 Why didn't they just comma delimit you may ask?",
"0010 Or even use quotations?",
"001 Who knows",
"0012 But now I'm here with his file",
"0013 And hoping someone has an elegant solution?"), con = x)
library(iotools)
input.file(x, formatter = dstrsplit, nsep = " ", col_types = "character")
# rowindex V1
# 1 0001 This is text for 0001
# 2 0002 This has spaces in between
# 3 0003 Yet this is only supposed to be two columns
# 4 0009 Why didn't they just comma delimit you may ask?
# 5 0010 Or even use quotations?
# 6 001 Who knows
# 7 0012 But now I'm here with his file
# 8 0013 And hoping someone has an elegant solution?
Elegant enough? ;-)
If you've already read the data in as a single-column data.frame
(as in @Jaap's answer), you can still benefit from the extreme speed of the "iotools" package by using the formatter directly, rather than calling it in the input.file
function.
In other words, use:
dstrsplit(as.character(mydf$V1), nsep = " ", col_types = "character")
In case anyone is interested, I benchmarked the solutions proposed by Jaap, and akrun against the "iotools" approach. You can find the results at this Gist. Summary: Whether dealing with a file on disk or a column of a file in memory, "iotoos" is the best performer. I did not test tomtom's solution because it would require further processing from what is already in their answer.
Upvotes: 3
Reputation: 887108
We can use tstrsplit
from data.table
. We convert the 'data.frame' to 'data.table' (setDT(mydf)
), using tstrsplit
on the column "V1", we split by space that follows a number (regex lookaround).
library(data.table)
res <- setDT(mydf)[, tstrsplit(V1, "(?<=\\d)\\s+", perl=TRUE)]
res
# V1 V2
#1: 0001 This is text for 0001
#2: 0002 This has spaces in between
#3: 0003 Yet this is only supposed to be two columns
#4: 0009 Why didnt they just comma delimit you may ask?
#5: 0010 Or even use quotations?
#6: 001 Who knows
#7: 0012 But now Im here with his file
#8: 0013 And hoping someone has an elegant solution?
If needed, the names can be changed with setnames
setnames(res, c("nr", "text"))
Upvotes: 0
Reputation: 83215
You can use the separate
function from the tidyr package for that (promoting my comment to an answer). You specify two column-names and with the extra = "merge"
parameter you make sure that everything after the first space is put into the second column:
library(tidyr)
separate(mydf, V1, c("nr","text"), sep = " ", extra = "merge")
# or:
mydf %>% separate(V1, c("nr","text"), sep = " ", extra = "merge")
you get:
nr text
1 0001 This is text for 0001
2 0002 This has spaces in between
3 0003 Yet this is only supposed to be two columns
4 0009 Why didnt they just comma delimit you may ask?
5 0010 Or even use quotations?
6 001 Who knows
7 0012 But now Im here with his file
8 0013 And hoping someone has an elegant solution?
Used data:
mydf <- structure(list(V1 = structure(c(1L, 2L, 3L, 4L, 6L, 5L, 7L, 8L),
.Label = c("0001 This is text for 0001", "0002 This has spaces in between",
"0003 Yet this is only supposed to be two columns", "0009 Why didnt they just comma delimit you may ask?",
"001 Who knows", "0010 Or even use quotations?", "0012 But now Im here with his file", "0013 And hoping someone has an elegant solution?"), class = "factor")),
.Names = "V1", class = "data.frame", row.names = c(NA,-8L))
Upvotes: 5
Reputation: 259
You may want to use the following (for instance in an lapply loop):
unlist(strsplit(gsub("([0-9]{1,}) ","\\1~",x), "~" ))
What it does is the following: gsub retains anything in between the parentheses (
and )
and stores it in the variable \\1. The [0-9] finds any number, and {1,} right behind it allows one or more occurences. Therefore, you first replace the space between the numbers and the text with a tilde (or anything else that isn't in your text) and then strsplit based on that.
Upvotes: 0