Reputation: 8942
I understand this is a very basic question but I don't understand what levels mean in R.
For reference, I have done a simple script to read CSV table, filter on one of the fields, pass this on to a new variable and clear the memory allocated for the first variable. If I call unique() on the field on which I filtered, I see that the results were indeed filtered but there is one additional line showing 'Levels' corresponding to data that is in the original dataset.
Example:
df = read.csv(path, sep=",", header=TRUE)
df_intrate = df[df$AssetClass == "ASSET CLASS A", ]
rm(df)
gc()
unique(df_intrate$AssetClass)
Results:
[1] ASSET CLASS A
Levels: ASSET CLASS E ASSET CLASS D ASSET CLASS C ASSET CLASS B ASSET CLASS A
Is the structural information from df
somehow preserved in df_intrate despite R studio showing that df_intrate is indeed the expected number of rows for ASSET CLASS A
?
Upvotes: 15
Views: 40400
Reputation: 2135
R has a character
class and a factor
class. character
is your basic string data structure. factor
is something important for statistics: for example, you may have a data set where people are divided by the connectedness of their earlobes (an important yet commonly overlooked distinction). In such a case, for each person, they'll have a value connected
or free
. If you were to model, say, intelligence, as a function of earlobe connection status, you'd want that model to understand that there are two classes of people: connected
or free
, so you'd represent that as a factor
vector, and that vector would have two levels
: connected
and free
. So that's semantically why levels are a thing in R.
Syntactically, factor
and character
variables respond to as.integer
differently. factor
variables convert to a number corresponding to their level, whereas a character
variable converts more like a traditional atoi
. In general, you can run into a lot of problems if you operate on a factor
variable thinking it's a character
.
When I'm reading a csv file, in most cases, I find I'd rather have character
values than factors
, so I typically set read.csv(..., stringsAsFactor=FALSE)
. (YMMV as to whether this is your general preference.)
ETA: When I'm reading a csv file, these days I'll use readr::read_csv
which defaults to character columns instead of factors. Among its many other virtues, it's much faster than read.csv
.
Upvotes: 2
Reputation: 145845
Is the structural information from df somehow preserved in df_intrate despite R studio showing that df_intrate is indeed the expected number of rows for ASSET CLASS A ?
Yes. This is how categorical variables, called factors, are stored in R - both the levels, a vector of all possible values, and the actual values taken, are stored:
x = factor(c('a', 'b', 'c', 'a', 'b', 'b'))
x
# [1] a b c a b b
# Levels: a b c
y = x[1]
# [1] a
# Levels: a b c
You can get rid of unused levels with droplevels()
, or by re-applying the factor
function, creating a new factor out of only what is present:
droplevels(y)
# [1] a
# Levels: a
factor(y)
# [1] a
# Levels: a
You can also use droplevels
on a data frame to drop all unused levels from all factor columns:
dat = data.frame(x = x)
str(dat)
# 'data.frame': 6 obs. of 1 variable:
# $ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 2
str(dat[1, ])
# Factor w/ 3 levels "a","b","c": 1
str(droplevels(dat[1, ]))
# Factor w/ 1 level "a": 1
Though unrelated to your current issue, we should also mention that factor
has an optional levels
argument which can be used to specify the levels of a factor and the order in which they should go. This can be useful if you want a specific order (perhaps for plotting or modeling), or if there are more possible levels than are actually present and you want to include them. If you don't specify the levels
, the default will be alphabetical order.
x = c("agree", "disagree", "agree", "neutral", "strongly agree")
factor(x)
# [1] agree disagree agree neutral strongly agree
# Levels: agree disagree neutral strongly agree
## not a good order
factor(x, levels = c("disagree", "neutral", "agree", "strongly agree"))
# [1] agree disagree agree neutral strongly agree
# Levels: disagree neutral agree strongly agree
## better order
factor(x, levels = c("strongly disagree", "disagree", "neutral", "agree", "strongly agree"))
# [1] agree disagree agree neutral strongly agree
# Levels: strongly disagree disagree neutral agree strongly agree
## good order, more levels than are actually present
You can use ?reorder
and ?relevel
(or just factor
again) to change the order of levels for an already created factor.
Upvotes: 7
Reputation: 37879
You see Levels
in the data structure in R called factor
. Factors are of integer type:
typeof(as.factor(letters))
#[1] "integer"
However, they have labels, which map each integer to a character specification (a label). You will see that factors are usually helpful in models where the algorithm would require numbers (sometimes in the form of dummy variables) but keeping the labels which make more sense to humans during the interpretation of the model.
Levels are an attribute of the vector:
attributes(as.factor(letters))
#$levels
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
#[18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
#$class
#[1] "factor"
Which means that once you subset your column to only ASSET CLASS A
the attributes of the column get transferred as well. This has nothing to do with the length of your vector though which is still [1]
.
Upvotes: 2