D'Arcy Mulder
D'Arcy Mulder

Reputation: 83

How is my data stored in R?

So, I'm trying to figure out a larger problem, and I think it may stem from exactly what's happening when I import data from a .txt file. My regular beginning commands are:

data<-read.table("mydata.txt",header=T)
attach(data)

So if my data has say, 3 columns with headers "Var1", "Var2" and "Var3", how exactly is everything imported? It seems as though it is imported as 3 separate vectors, then bound together, similar to using cbind().

My larger issue is modifying the data. If a row in my data frame has an empty spot (in any column) I need to remove it:

data <- data[complete.cases(data),]

Perfect - now say that the original data frame had 100 rows, 5 of which had an empty slot. My new data frame should have 95 rows, right? Well if I try:

> length(Var1)
[1] 100
> length(data$Var1)
[1] 95

So it seems like the original column labelled Var1 is unaffected by the line where I rewrote the entire data frame. This is why I believe that when I import the data, I really just have 3 separate columns stored somewhere called Var1, Var2 and Var3. As far as getting R to recognize that I want the modified version of the column, I think I need to do something along the lines of:

Var1 <- data$Var1 #Repeat for every variable

My issue with this is that I will need to write the above bit of code for every single variable. The data frame I have is large, and this way of coding seems tedious. Is there a better way for me to transform my data, then be able to call the modified variables, without needing to use the data$ precursor every time?

Upvotes: 2

Views: 1716

Answers (2)

Gavin Simpson
Gavin Simpson

Reputation: 174778

read.table() reads the data into a data frame with a component (column) for each column (variable) in the text file. R's data frame is like an Excel spreadsheet, each column in the sheet can contain a different type of data (contrast that with a matrix, which in R can contain data only of a single type).

In effect, the result is as if the data were read in column by column and then bound together column-wise using the cbind.data.frame() method. This is not how it is done in practice though. You have a single object data with three components, none of which can be accessed by typing their name (e.g. Var1). Try exactly this

data <- read.table("mydata.txt", header = TRUE)
Var1

in a clean session (best if you start a new session to try this, just in case).

If you were to type ls() you would see only data listed (assuming a clean session). This is clearl evidence against your thinking that you have three columns and individual objects.

The real problem here is attach() not read.table().

There are very few good uses of attach() and the one you show is not among them. attach(data) places a copy of data on the search path. The key point there is copy. What is on the search path is not the same thing as data in the global environment (your workspace). Any changes to the data in the global environment are not reflected in the copy on the search path, because these are two, completely separate objects.

R has a search path where it looks for named objects. Normally R doesn't look inside objects and hence Var1 etc will not be found whenever you type their name at the prompt or attempt to use the object directly. When you attach() an object you can think of this as opening the object up to R's search. But the thing that catches people out is that one is now looking inside a copy of the object and not the object itself.

In interactive sessions, there are useful helper functions that mean you don't need to be typing data$ all the time. See ?with, ?within, ?transform for example.

Really don't use attach() in lieu of a bit of typing.

Upvotes: 7

gung - Reinstate Monica
gung - Reinstate Monica

Reputation: 11893

I'm pretty sure R reads files row by row. (In fact, I think just about all programming languages work this way.) I wonder if you are attaching your data frame before removing the incomplete cases. The behavior you describe is fairly typical when people call attach(data) beforehand. In general, it is recommended that you do not use attach() at all in R. But if you must use it, call detach(data) first, then modify the data frame, and then (if you must) call attach(data) again. At that point, you will no longer have this problem.

Note, it is also possible that your problem is something different. However, we cannot tell, based on the information provided thus far. You will want to provide a reproducible example so that people can help you more effectively, see here: how-to-make-a-great-r-reproducible-example.

Upvotes: 3

Related Questions