Reputation: 131
I have a csv file with this format :
android ; login.html , connect.json , page1.json
windows ; login.html , connect.json , page1.json , page2.html , page5.html
windows ; login.html , connect.json , page4.json
To do PCA multivariate analysis with these variables, these variable must be numeric like this :
1 ; 3
0 ; 5
0 ; 3
0 or 1 to indicate whether windows or android followed by the number of pages. I am looking for a way to modify these non numeric data Any idea please? Best
Upvotes: 0
Views: 1092
Reputation: 269501
Try strsplit
and lengths
:
DF <- read.table(text = Lines, sep = ";", as.is = TRUE, strip.white = TRUE)
transform(DF, V1 = as.numeric(V1 == "android"), V2 = lengths(strsplit(V2, ",")))
giving:
V1 V2
1 1 3
2 0 5
3 0 3
Note: We used this input:
Lines <- "android ; login.html , connect.json , page1.json
windows ; login.html , connect.json , page1.json , page2.html , page5.html
windows ; login.html , connect.json , page4.json"
Upvotes: 1
Reputation: 193517
Here's one approach:
data.frame(V1 = as.numeric(mydf$V1 == "android"),
V2 = count.fields(textConnection(mydf$V2), sep = ","))
# V1 V2
# 1 1 3
# 2 0 5
# 3 0 3
Sample data:
mydf <- read.table(
header = FALSE, sep = ";", stringsAsFactors = FALSE, strip.white = TRUE,
text = '"android" ; "login.html , connect.json , page1.json"
"windows" ; "login.html , connect.json , page1.json , page2.html , page5.html"
"windows" ; "login.html , connect.json , page4.json"')
Upvotes: 2