Reputation: 81
Let's say I have a dataset of ACT test scores. Each "observation" is a student's results from taking the ACT. The ACT has five subjects: reading, English, math, science, and writing (plus a composite score). Each test subject has a scale score, a national percentile rank, and a college readiness indicator (Y
or N
).
My question is (and always seems to be since I work a lot with assessment data), which format is "tidy"?
subject
column and then scaleScore
, percentile
, and readiness
columns for each value.I've been working in SQL + Excel for a while, but I want to expand my EDA skills in R. Any help would be much appreciated! The key focus is on subsequent visualization with ggplot
. I'm guessing the answer may just be "it depends" with a willingness to gather
and spread
for different plotting purposes.
Upvotes: 0
Views: 113
Reputation: 146
Columns being student, test, subject, scaleScore, percentile, readiness.
Student and test variables would identify each observation.
Subject is a variable. Reading, English, math, etc. are values of the subject variable. This is essentially the heart of the tidy approach, which tends to be deep, not wide, and lends itself to joining, grouping, plotting, and so forth.
OR to make it really tidy, score and scoreType are variables, and their respective values are included as observations.
Either way, in one table the student and test would be repeated on multiple rows. But this serves to illustrate the tidy perspective. Clearly, normalized tables are a worthy consideration, in terms of the big picture.
Upvotes: 1