LearneR
LearneR

Reputation: 2531

How to remove the [1]s, [[1]]s and double quotes from a csv data in R?

I've a CSV file. It contains the output of some previous R operations, so it is filled with the index numbers (such as [1], [[1]]). When it is read into R, it looks like this, for example:

        V1
1                                                                                                           [1] 789
2                                                                                                             [[1]]
3                                                           [1] "PNG"        "D115"    "DX06"    "Slz"
4                                                                                                           [1] 787
5                                                                                                             [[1]]
6                                                                       [1] "D010"           "HC"
7                                                                                                           [1] 949
8                                                                                                             [[1]]
9                                                                       [1] "HC" "DX06"          

(I don't know why all that wasted space between line number and the output data)

I need the above data to appear as follows (without [1] or [[1]] or " " and with the data placed beside its corresponding number, like):

789 PNG,D115,DX06,Slz
787 D010,HC
949 HC,DX06

(possibly the 789 and its corresponding data PNG,D115,DX06,Slz should be separated by a tab.. and like that for each row)

How to achieve this in R?

Upvotes: 1

Views: 64

Answers (2)

smci
smci

Reputation: 33938

Honestly, a command-line fix using either sed/perl/egrep -o is less pain:

sed -e 's/.*\][ \t]*//' dirty.csv > clean.csv 

Upvotes: 1

akrun
akrun

Reputation: 887108

We could create a grouping variable ('indx'), split the 'V1' column using the grouping index after removing the parentheses part in the beginning as well as the quotes within the string ". Assuming that we need the first column as the numeric element, and the second column as the non-numeric part, we can use regex to replace the space with , (as showed in the expected result, and then rbind the list elements.

indx <- cumsum(c(grepl('\\[\\[', df1$V1)[-1], FALSE))
 do.call(rbind,lapply(split(gsub('"|^.*\\]', '', df1$V1), indx),
         function(x) data.frame(ind=x[1],
    val=gsub('\\s+', ',', gsub('^\\s+|\\s+$', '',x[-1][x[-1]!=''])))))

 #   ind               val
 #1  789 PNG,D115,DX06,Slz
 #2  787           D010,HC
 #3  949           HC,DX06

data

 df1 <- structure(list(V1 = c("[1] 789", "[[1]]", 
 "[1] \"PNG\"        \"D115\"    \"DX06\"    \"Slz\"", 
 "[1] 787", "[[1]]", "[1] \"D010\"           \"HC\"", "[1] 949", 
 "[[1]]", "[1] \"HC\" \"DX06\"")), .Names = "V1", 
 class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", 
 "7", "8", "9"))

Upvotes: 3

Related Questions