Reputation:
I have a factor with 1000 rows and 848 levels (i.e. some rows are empty). For each row, I want to count the number of elements (i.e., one element = 1, 2 elements = 2, empty row = 0, etc.). A simpler way to describe it is: I want to convert a factor into a data.frame, but I want to change the data type from factor to numeric and keep the values in each row.
v.m.two <- Output[,1]
v.m.two <- data.frame(v.m.two)
class(v.m.two)
[1] data.frame
class(v.m.two[1,]
[1] factor
dim(v.m.two)
[1] 1000 1
v.m.two[1,]
[1] 848 Levels: 0 1000 1002, 4875, 4082, 1952 1015, 2570, 3524 1017 1020, 1576 ... 983, 4381,
2256, 4361, 4271
Any suggestions?
v.m.two
1 2633, 4868
2 126, 4860
3 0
4 122, 4762
5 4256
6 2933, 2892, 2389
Basically, I want to count the values in each row (e.g., row 1 is 2, row 2 is 2, row 3 is 0, etc.).
Upvotes: 0
Views: 216
Reputation: 99371
You have erroneous commas which is causing the factors. Try scan
scan(text=with(v.m.two, levels(v.m.two)[v.m.two]), sep=",", what=integer())
# Read 11 items
# [1] 2633 4868 126 4860 0 122 4762 4256 2933 2892 2389
And to count the lengths and convert to numeric, you can also use strsplit
s <- strsplit(as.character(v.m.two[[1]]), ", ")
vapply(s, length, integer(1L)) ## row 3 is actually 1 if there's a zero there
# [1] 2 2 1 2 1 3
as.numeric(do.call(c, s))
# [1] 2633 4868 126 4860 0 122 4762 4256 2933 2892 2389
Upvotes: 1
Reputation: 887901
1 Converting factor to numeric
If you want to convert the factor
columns to numeric
and want to have separate columns based on the number of elements in each row.
library(splitstackshape)
res <- cSplit(v.m.two, 'v.m.two', sep=",")
res
# v.m.two_1 v.m.two_2 v.m.two_3
#1: 2633 4868 NA
#2: 126 4860 NA
#3: 0 NA NA
#4: 122 4762 NA
#5: 4256 NA NA
#6: 2933 2892 2389
str(res)
#Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables:
#$ v.m.two_1: int 2633 126 0 122 4256 2933
# $ v.m.two_2: int 4868 4860 NA 4762 NA 2892
#$ v.m.two_3: int NA NA NA NA NA 2389
If you need a vector
, you could use stri_split
from stringi
library(stringi)
as.numeric(unlist(stri_split(v.m.two[,1], regex=",")))
#[1] 2633 4868 126 4860 0 122 4762 4256 2933 2892 2389
2. Counting values in row
For counting the values in each row of v.m.two
, you could either count from the res
above or from v.m.two
. In the first option, we are counting the number of NAs
in each row of res
and then multiplying with the logical index derived from whether the first column of v.m.two
is 0
or not. The TRUE
values i.e. !=0
will get the count
while the FALSE
will coerce to 0
ie. 0 * value=0
(v.m.two[,1]!=0)*(rowSums(!is.na(res)))
#[1] 2 2 0 2 1 3
You could use stri_count
from stringi
which would be fast (counting occurrence of particular letter in vector of words in r). Here as above, you can either use the arithmetic
i.e. multiplying or could use ifelse
. The regex
can be based on digits
or ,
. If you are using ,
, then make sure to add 1
.
ifelse(v.m.two[,1]=0, stri_count(v.m.two[,1], regex="\\d+"), 0)
# [1] 2 2 0 2 1 3
#Or
(v.m.two[,1]!=0) *stri_count(v.m.two[,1], regex="\\d+")
#[1] 2 2 0 2 1 3
#Or
(v.m.two[,1]!=0) *(stri_count(v.m.two[,1], regex=",") +1)
#[1] 2 2 0 2 1 3
Another option to count would be to use gsub
and nchar
from base R
.
(v.m.two[,1]!=0) *( nchar(gsub("[^,]", "", v.m.two[,1]))+1)
#[1] 2 2 0 2 1 3
v.m.two <- structure(list(v.m.two = structure(c(4L, 3L, 1L, 2L, 6L, 5L),
.Label = c("0", "122, 4762", "126, 4860", "2633, 4868", "2933, 2892, 2389",
"4256"), class = "factor")), .Names = "v.m.two", row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")
Upvotes: 0