Reputation: 2161
I have a survey to analyze that was completed by participants on SurveyMonkey. Unfortunately, the way the data are organized is not ideal, in that each categorical response for each question has its own column.
Here, for example, are the first few lines of one of the responses in the dataframe:
How long have you been participating in the Garden Awards Program? \
0 One year
1 NaN
2 NaN
3 NaN
4 NaN
Unnamed: 10 Unnamed: 11 Unnamed: 12 \
0 2-3 years 4-5 years 5 or more years
1 NaN NaN NaN
2 NaN 4-5 years NaN
3 2-3 years NaN NaN
4 NaN NaN 5 or more years
How did you initially learn of the Garden Awards Program? \
0 I nominated my garden to be evaluated
1 NaN
2 I nominated my garden to be evaluated
3 NaN
4 NaN
Unnamed: 14 etc...
0 A friend or family member nominated my garden ...
1 A friend or family member nominated my garden ...
2 NaN
3 NaN
4 NaN
This question, How long have you been participating in the Garden Awards Program?
, has valid responses: one year
, 2-3 years
, etc., and are all found on the first row as a key to which column holds which value. This is the first problem. (Similarly for How did you initially learn of the Garden Awards Program?
, where valid responses are: I nominated my garden to be evaluated
, A friend or family member nominated my garden
, etc.).
The second problem is that the attached columns for each categorical response all are Unnamed: N
, where N is as many columns as there are categories associated with all questions.
Before I start remapping and flattening/collapsing the columns into a single one per question, I was wondering if there was any other way of dealing with survey data presented like this using Pandas. All my searches pointed to the SurveyMonkey API, but I don't see how that would be useful.
I am guessing that I will need to flatten the columns, and thus, if anyone could suggest a method, that would be great. I'm thinking that there is a way to keep grabbing all columns belonging to a categorical response by grabbing an adjacent column until Unnamed
is no longer in the column name, but I am clueless how to do this.
Upvotes: 3
Views: 982
Reputation: 11034
I will use the following DataFrame
(which can be downloaded as CSV from here):
Q1 Unnamed: 2 Unnamed: 3 Q2 Unnamed: 5 Unnamed: 6 Q3 Unnamed: 7 Unnamed: 8
0 A1-A A1-B A1-C A2-A A2-B A2-C A3-A A4-B A3-C
1 A1-A NaN NaN NaN A2-B NaN NaN NaN A3-C
2 NaN A1-B NaN A2-A NaN NaN NaN A4-B NaN
3 NaN NaN A1-C NaN A2-B NaN A3-A NaN NaN
4 NaN A1-B NaN NaN NaN A2-C NaN NaN A3-C
5 A1-A NaN NaN NaN A2-B NaN A3-A NaN NaN
Key assumptions:
Unnamed
is actually the title of a questionSolution overview:
pd.Series
)Implementation (part 1):
indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
questions = [c for c in df.columns if not c.startswith('Unnamed')]
slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]
You can see that iterating for the over the slices like below you get a single DataFrame
corresponding to each question:
for q in slices:
print(df.iloc[:, q]) # Use `display` if using Jupyter
Implementation (part 2-3):
def parse_response(s):
try:
return s[~s.isnull()][0]
except IndexError:
return np.nan
data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
df = pd.concat(data, axis=1)
df.columns = questions
Output:
Q1 Q2 Q3
1 A1-A A2-B A3-C
2 A1-B A2-A A4-B
3 A1-C A2-B A3-A
4 A1-B A2-C A3-C
5 A1-A A2-B A3-A
Upvotes: 4