Which format is tidy?

Question

Let's say I have a dataset of ACT test scores. Each "observation" is a student's results from taking the ACT. The ACT has five subjects: reading, English, math, science, and writing (plus a composite score). Each test subject has a scale score, a national percentile rank, and a college readiness indicator (Y or N).

My question is (and always seems to be since I work a lot with assessment data), which format is "tidy"?

where each row is a unique student test + subject combination with a subject column and then scaleScore, percentile, and readiness columns for each value.
where each row is a unique student test with all the subjects and their respective values listed out in separate columns.
Or where I have something like the first option but put into six tables one for each subject with a key to join on?

I've been working in SQL + Excel for a while, but I want to expand my EDA skills in R. Any help would be much appreciated! The key focus is on subsequent visualization with ggplot. I'm guessing the answer may just be "it depends" with a willingness to gather and spread for different plotting purposes.

scottr · Accepted Answer

Columns being student, test, subject, scaleScore, percentile, readiness.

Student and test variables would identify each observation.

Subject is a variable. Reading, English, math, etc. are values of the subject variable. This is essentially the heart of the tidy approach, which tends to be deep, not wide, and lends itself to joining, grouping, plotting, and so forth.

OR to make it really tidy, score and scoreType are variables, and their respective values are included as observations.

Either way, in one table the student and test would be repeated on multiple rows. But this serves to illustrate the tidy perspective. Clearly, normalized tables are a worthy consideration, in terms of the big picture.

Which format is tidy?

Answers (1)

Related Questions