January
January

Reputation: 17140

How to read this file in R

I have the following file:

GroupA Whatever1 A B C
GroupB Whatever2 C D
GroupC Whatever3 E F G H

Columns 1 and 2 should be read as separate columns. The remainder of each line should not be split at tabs and should make the third column. Here is one ugly way of doing this:

foo <- read.table( text="GroupA Whatever1 A B C
GroupB Whatever2 C D
GroupC Whatever3 E F G H", fill=T, header=F)
foo2 <- data.frame( foo$V1, foo$V2, 
  V3=apply( foo[,-c(1,2)], 1, 
    function(x) paste0( x, collapse="\t")))

The result is what I would like to have:

> foo2
  foo.V1    foo.V2         V3
1 GroupA Whatever1  A\tB\tC\t
2 GroupB Whatever2   C\tD\t\t
3 GroupC Whatever3 E\tF\tG\tH

Is there a better way? Preferably one that does not require to first split and then paste again the columns? Some of these lines in the table are very, very long.

Upvotes: 1

Views: 124

Answers (3)

LikosX
LikosX

Reputation: 71

Why not some command-line manipulation of your dataset using awk? (or another language like python, perl, ect)

Here a solution with awk. First you need to know the maximum number of columns of your dataset:

foo=$(awk "BEGIN{max=0} {if(NF>max) max=NF} END{print max}" dataset_file_name)

Now you can use the awk script, passing to it the parameter we have just calculated, and redirecting the output to a new file:

awk -f my_script.awk -v max=$foo dataset_file_name > my_new_dataset

In R you have to specify the separator (a blank space):

bar <- read.table("my_new_dataset", sep=" ")

And you'll get the same result of the example you have posted.


Here's the code of my_script.awk:

{
  printf("%s %s ", $1, $2);
  for(i=3; i<NF; i++) printf("%s/t", $i);
  printf("%s", $NF)
  if(NF<max)
    for(i=1; i<(max=1-NF); i++) printf("\t");
  printf("\n");
}

Another possible solution could be to split the dataset file into two separate files, the first one with the first two columns and the second one with all the others columns. That way maybe you'll find easier to import and manipulate the third column of your dataset.

On the command line:

cut -f -2 dataset_file_name > dataset_columns_1_2
cut -f 3- dataset_file_name > dataset_rest_of_columns

[Note: the separator for cut is tab, if the columns separator is different, use the -d option (for more info cut --help).]

Actually I cannot see any real advantage by using this approach... but I think that the cut command could be useful in some other occasions.

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193677

I would consider a string-splitting function that lets you specify the number of resulting pieces. For instance, you might be able to use stri_split_fixed from the "stringi" package.

Here, I assume you've used readLines to get the text in from your file:

text <- c("GroupA Whatever1 A B C", 
          "GroupB Whatever2 C D", 
          "GroupC Whatever3 E F G H")

library(stringi)

stri_split_fixed(text, " ", 3, simplify = TRUE)
#      [,1]     [,2]        [,3]     
# [1,] "GroupA" "Whatever1" "A B C"  
# [2,] "GroupB" "Whatever2" "C D"    
# [3,] "GroupC" "Whatever3" "E F G H"

From there, if you really wanted to replace spaces with tabs in the last column, that should be a very straightforward gsub operation.

Upvotes: 1

joran
joran

Reputation: 173697

Maybe with tidyr...?

> library(tidyr)
> unite(foo,foo_all,V3:V6,sep = "\t")
      V1        V2    foo_all
1 GroupA Whatever1  A\tB\tC\t
2 GroupB Whatever2   C\tD\t\t
3 GroupC Whatever3 E\tF\tG\tH

Upvotes: 1

Related Questions