JamesS
JamesS

Reputation: 310

Combining several tables with some common fields into a single table

I have a series of text files (one per year) which contain the answers to an annual survey/questionnaire. Column headings in the files refer to question numbers, and each row represents one person's answers to the questions e.g.

Q1,   Q2,   Q3, ...
P1A1, P1A2, P1A3, ...
P2A1, P2A2, P2A3, ...
etc.
[where Q1 is Question 1, P1 is Person 1 and A1 is Answer 1].

The questionnaire uses the same core set of questions each year, but each year new questions are also added and some old ones are removed. My tables therefore have lots of fields in common, but they're not all the same e.g.

Year 1             |  Year 2           |  Year 3              etc.
Q1,   Q2,   Q3     |  Q1,   Q2,   Q4   |  Q1,   Q2,   Q5
P1A1, P1A2, P1A3   |  P1A1, P1A2, P1A4 |  P1A1, P1A2, P1A5
P2A1, P2A2, P2A3   |  P2A1, P2A2, P2A4 |  P2A1, P2A2, P2A5

In this example Q1 and Q2 are the core questions, while Qs 3, 4 & 5 depend on the survey year.

Removed question numbers are never re-used: if a new question is added, it's given a completely new number. The real surveys have between 300 and 500 questions, and approximately 40,000 people respond each year.

I want to combine all of this data into a single table, the column headings of which would be the set of distinct headings in the original files, plus a column for the year. For years where a question isn't relevant, I'd like to have nulls e.g.

Year,  Q1,   Q2,   Q3,   Q4,   Q5
1,     P1A1, P1A2, P1A3, Null, Null
1,     P2A1, P2A2, P2A3, Null, Null
2,     P1A1, P1A2, Null, P1A4, Null
2,     P2A1, P2A2, Null, P2A4, Null
3,     P1A1, P1A2, Null, Null, P1A5
3,     P2A1, P2A2, Null, Null, P2A5

Essentially, I just want to append columns where they match and otherwise have nulls, but if possible I'd like to generate the set of column headings for the final table automatically (from the input tables) as the number of questions in the survey is large and the prospect of working-out and then typing the Create Table statement is horrific! I suppose what I'm after is some kind of variation on UNION ALL that doesn't require me to specify all the columns in order and which can handle the non-matching columns.

Does anyone have any tips or suggestions, please? I was originally hoping to use SQLite, but I'm willing to try pretty-much anything. I'm not a databasing expert, but I have a basic understanding of SQL and can do a bit of Python scripting if that's any help.

If you've made it this far thanks for reading! I thought this would be a very simple problem, but it's surprisingly difficult to describe in detail.

Upvotes: 1

Views: 356

Answers (3)

pyInTheSky
pyInTheSky

Reputation: 1469

here is a rough sudo code sketch using only basic python stuff

have a master list for all 'entries'

  • read in the first line of your file, getting all the questions
  • create a set() and add to the set all questions from all years
  • create a dictionary with the 'keys' as the question
  • additionally add a key for person's name and the year
  • load up a new dictionary for each person and append it to your list
  • do this for each file, st you have a list of dictionaries
  • make yourself a formatting function which asks element_in_list.has_key('a question from the set of questions') ... if it does, great, if not, print "NONE" or w/e

your set() of questions should be casted to a list() to preserve order in the final output

[
  {"year":"2001","name":"bob","q1":"hello"}
  {"year":"2011","name":"rob","q5":"world"}
]

Upvotes: 1

S.Lott
S.Lott

Reputation: 391952

Step 1. Stop coding.

Step 2. Buy a book on data warehousing and star-schema design. Ralph Kimball's Data Warehouse Toolkit, for example. http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247

Step 3. Design a proper star schema. Question is a dimension. Time (specifically year) is a dimension. Respondent (if you know them) is a dimension. Response (individual survey instrument) is a dimension. Facts are answers to questions in a given year. Each survey becomes multiple rows of facts.

Step 4. Load your star schema with your various years of data. The dimensions (questions) are a little complex to load because each year reuses some questions and adds new ones. Once the dimensions are loaded, the fact rows are pretty easy to load.

Step 5. Now you can write a quick reporting application to extract the requisite data from the various fact rows and assemble the desired report.

Upvotes: 2

dfb
dfb

Reputation: 13289

First SQLite isn't going to have the fancy features of a larger DBMS. If you don't want to write a select statement fragment for each question, you'd need a pivot table, which SQLite doesn't have.

The original format is confusing - I would create a single table with columns Person, Question, Answer and Year. (Parse out the person and the question number.) and go at it from there. If this doesn't work, use the original format in a single table. From this, computing the final table is something you could easily do in Python instead of SQL

Upvotes: 1

Related Questions