Betty
Betty

Reputation: 31

R programming: cleaning data

I have a question about R programming.

If I have a dataset like the following:

LA NY MA
1 2 3
4 5 6
3 5
4

(In other words, not all rows are the same.) I am trying to use lm to perform an ANOVA test (to decide whether the mean number is the same in each state), and it keeps showing "an error occurred" because rows do not match. How can I fix this issue? Also, when I do lm, I usually do lm(y~x), so if I want to do lm(y~LA), then there's no y variable to type in. Should I create a new column/row for this?

Upvotes: 0

Views: 245

Answers (2)

Dr Nisha Arora
Dr Nisha Arora

Reputation: 738

You can use gather() from tidyr package to shape data into long format for the purpose of analysis. It takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.

Sample code:

LA <- c(1,4,3,4)
NY <- c(4,5,6, NA)
MA <- c(3,6, NA, NA)
df <- data.frame(LA, NY, MA) # data in wide format

library(tidyr)
df <- df %>% gather(attribute, value) # data in long format

Upvotes: 0

Rich Scriven
Rich Scriven

Reputation: 99371

Maybe you could do something like this. To read the data, use the fill argument in read.table. Where text = txt, you would put your file name there.

(dat <- read.table(text = txt, header = TRUE, fill = TRUE))
#   LA NY MA
# 1  1  2  3
# 2  4  5  6
# 3  3  5 NA
# 4  4 NA NA

Then we can take the column means and create a new two column data frame.

cm <- colMeans(dat, na.rm = TRUE)
data.frame(state = names(cm), mean = unname(cm))
#   state mean
# 1    LA  3.0
# 2    NY  4.0
# 3    MA  4.5

where txt is

txt <- "LA NY MA
1 2 3
4 5 6
3 5
4"

Upvotes: 1

Related Questions