Reputation: 727
I have a file that I read in into R and is translated to a dataframe (called CA1) to have the structure as followed:
Station_ID Guage_Type Lat Long Date Time_Zone Time_Frame H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12 H13 H14 H15 H16 H17 H18 H19 H20 H21 H22 H23
1 4457700 HI 41.52 124.03 19480701 8 LST 0 0 0 0 0 0 0 0 0 0 0 0 MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS
2 4457700 HI 41.52 124.03 19480705 8 LST 0 1 1 1 1 1 2 2 2 4 5 5 4 7 1 1 0 0 10 13 5 1 1 3
3 4457700 HI 41.52 124.03 19480706 8 LST 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4457700 HI 41.52 124.03 19480727 8 LST 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 4457700 HI 41.52 124.03 19480801 8 LST 0 0 0 0 0 0 0 0 0 0 0 0 MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS
6 4457700 HI 41.52 124.03 19480817 8 LST 0 0 0 0 0 0 ACC ACC ACC ACC ACC ACC 6 1 0 0 0 0 0 0 0 0 0 0
H0 through H23 are read in as character() since there will be cases when the value will not be numeric and will have values such as MIS, ACC, or DEL.
My question: is there a way to typecast the values for each column H0 through H23 to be numeric and have the character values (MIS, ACC, DEL) as NA or NAN which I can test for it if they are (is.nan or is.na) so I can run some numeric models on it. Or would it be best to have the character values to be changed to an identifier, such as -9999?
I have tried many ways. I have found a few on this site but none of work. Such as:
for (i in 8:31)
{
CA1[6,i] <- as.numeric(as.character(CA1[6,i]))
}
which of course gives warnings but as I test if two specific values is_numeric() (CA1[6,8] and CA1[6,19]) I get a false statement for both. The first I don't understand why, but the second I do since it is a "". However, I can test that with is.na(CA1[6,19]) and returns true, which is just fine for me to know it is not numeric.
A second way I tried is:
for (i in 8:31)
{
CA1[6,i] <- as.numeric(levels(CA1[6,i]))[CA1[6,i]]
}
which I get the same results as before.
Is there a way of doing what I am trying to do in an efficient manner? Your help is greatly appreciated. Thank you
Upvotes: 6
Views: 11532
Reputation: 102076
The immediate problem is each column of a data frame can only contain values of one type. The 6
in CA1[6,i]
in your code means that only a single value is being converted in each column, so, when it is inserted after conversion, it has to be coerced back to a string to match the rest of the column.
You can solve this by converting the whole column in one go, so that the column is entirely replaced. i.e. remove the 6
:
for (i in 8:31)
{
CA1[,i] <- as.numeric(as.character(CA1[,i]))
}
Upvotes: 6
Reputation: 19454
Following on Tommy's answer, you potentially could deal with this issue when reading in the data. If "MIS"
, "ACC"
and "DEL"
always denote missing values, you could use the na.strings
argument in read.table
.
read.table('foo.txt', header=TRUE, na.strings = c("MIS", "ACC", "DEL"))
If there are other character strings that always denote missing values, then you could add them to the above vector.
However, if, for example, "MIS"
appears in the column Time_Frame
and it has a meaning other than to denote a missing value, then DO NOT TAKE THIS APPROACH!!
Upvotes: 2
Reputation: 40821
When you read in the data, you can typically specify what the column types are. For example, read.table
/ read.csv
have a colClasses
argument.
# Something like this
read.table('foo.txt', header=TRUE, colClasses=c('integer', 'factor', 'numeric', numeric', 'Date'))
See ?read.table
for more information.
Upvotes: 6