Reputation: 17140
I have the following file:
GroupA Whatever1 A B C
GroupB Whatever2 C D
GroupC Whatever3 E F G H
Columns 1 and 2 should be read as separate columns. The remainder of each line should not be split at tabs and should make the third column. Here is one ugly way of doing this:
foo <- read.table( text="GroupA Whatever1 A B C
GroupB Whatever2 C D
GroupC Whatever3 E F G H", fill=T, header=F)
foo2 <- data.frame( foo$V1, foo$V2,
V3=apply( foo[,-c(1,2)], 1,
function(x) paste0( x, collapse="\t")))
The result is what I would like to have:
> foo2
foo.V1 foo.V2 V3
1 GroupA Whatever1 A\tB\tC\t
2 GroupB Whatever2 C\tD\t\t
3 GroupC Whatever3 E\tF\tG\tH
Is there a better way? Preferably one that does not require to first split and then paste again the columns? Some of these lines in the table are very, very long.
Upvotes: 1
Views: 124
Reputation: 71
Why not some command-line manipulation of your dataset using awk
? (or another language like python, perl, ect)
Here a solution with awk
. First you need to know the maximum number of columns of your dataset:
foo=$(awk "BEGIN{max=0} {if(NF>max) max=NF} END{print max}" dataset_file_name)
Now you can use the awk script, passing to it the parameter we have just calculated, and redirecting the output to a new file:
awk -f my_script.awk -v max=$foo dataset_file_name > my_new_dataset
In R you have to specify the separator (a blank space):
bar <- read.table("my_new_dataset", sep=" ")
And you'll get the same result of the example you have posted.
Here's the code of my_script.awk:
{
printf("%s %s ", $1, $2);
for(i=3; i<NF; i++) printf("%s/t", $i);
printf("%s", $NF)
if(NF<max)
for(i=1; i<(max=1-NF); i++) printf("\t");
printf("\n");
}
Another possible solution could be to split the dataset file into two separate files, the first one with the first two columns and the second one with all the others columns. That way maybe you'll find easier to import and manipulate the third column of your dataset.
On the command line:
cut -f -2 dataset_file_name > dataset_columns_1_2
cut -f 3- dataset_file_name > dataset_rest_of_columns
[Note: the separator for cut
is tab, if the columns separator is different, use the -d
option (for more info cut --help
).]
Actually I cannot see any real advantage by using this approach... but I think that the cut
command could be useful in some other occasions.
Upvotes: 0
Reputation: 193677
I would consider a string-splitting function that lets you specify the number of resulting pieces. For instance, you might be able to use stri_split_fixed
from the "stringi" package.
Here, I assume you've used readLines
to get the text in from your file:
text <- c("GroupA Whatever1 A B C",
"GroupB Whatever2 C D",
"GroupC Whatever3 E F G H")
library(stringi)
stri_split_fixed(text, " ", 3, simplify = TRUE)
# [,1] [,2] [,3]
# [1,] "GroupA" "Whatever1" "A B C"
# [2,] "GroupB" "Whatever2" "C D"
# [3,] "GroupC" "Whatever3" "E F G H"
From there, if you really wanted to replace spaces with tabs in the last column, that should be a very straightforward gsub
operation.
Upvotes: 1
Reputation: 173697
Maybe with tidyr...?
> library(tidyr)
> unite(foo,foo_all,V3:V6,sep = "\t")
V1 V2 foo_all
1 GroupA Whatever1 A\tB\tC\t
2 GroupB Whatever2 C\tD\t\t
3 GroupC Whatever3 E\tF\tG\tH
Upvotes: 1