horcle_buzz
horcle_buzz

Reputation: 2161

SurveyMonkey data formatting using Pandas

I have a survey to analyze that was completed by participants on SurveyMonkey. Unfortunately, the way the data are organized is not ideal, in that each categorical response for each question has its own column.

Here, for example, are the first few lines of one of the responses in the dataframe:

     How long have you been participating in the Garden Awards Program?  \
0                                           One year                   
1                                                NaN                   
2                                                NaN                   
3                                                NaN                   
4                                                NaN                   

  Unnamed: 10 Unnamed: 11      Unnamed: 12  \
0   2-3 years   4-5 years  5 or more years   
1         NaN         NaN              NaN   
2         NaN   4-5 years              NaN   
3   2-3 years         NaN              NaN   
4         NaN         NaN  5 or more years   

  How did you initially learn of the Garden Awards Program?  \
0              I nominated my garden to be evaluated          
1                                                NaN          
2              I nominated my garden to be evaluated          
3                                                NaN          
4                                                NaN          

                                         Unnamed: 14  etc...
0  A friend or family member nominated my garden ...  
1  A friend or family member nominated my garden ...  
2                                                NaN  
3                                                NaN  
4                                                NaN  

This question, How long have you been participating in the Garden Awards Program?, has valid responses: one year, 2-3 years, etc., and are all found on the first row as a key to which column holds which value. This is the first problem. (Similarly for How did you initially learn of the Garden Awards Program?, where valid responses are: I nominated my garden to be evaluated, A friend or family member nominated my garden, etc.).

The second problem is that the attached columns for each categorical response all are Unnamed: N, where N is as many columns as there are categories associated with all questions.

Before I start remapping and flattening/collapsing the columns into a single one per question, I was wondering if there was any other way of dealing with survey data presented like this using Pandas. All my searches pointed to the SurveyMonkey API, but I don't see how that would be useful.

I am guessing that I will need to flatten the columns, and thus, if anyone could suggest a method, that would be great. I'm thinking that there is a way to keep grabbing all columns belonging to a categorical response by grabbing an adjacent column until Unnamed is no longer in the column name, but I am clueless how to do this.

Upvotes: 3

Views: 982

Answers (1)

Gustavo Bezerra
Gustavo Bezerra

Reputation: 11034

I will use the following DataFrame (which can be downloaded as CSV from here):

     Q1 Unnamed: 2 Unnamed: 3    Q2 Unnamed: 5 Unnamed: 6    Q3 Unnamed: 7 Unnamed: 8
0  A1-A       A1-B       A1-C  A2-A       A2-B       A2-C  A3-A       A4-B       A3-C
1  A1-A        NaN        NaN   NaN       A2-B        NaN   NaN        NaN       A3-C
2   NaN       A1-B        NaN  A2-A        NaN        NaN   NaN       A4-B        NaN
3   NaN        NaN       A1-C   NaN       A2-B        NaN  A3-A        NaN        NaN
4   NaN       A1-B        NaN   NaN        NaN       A2-C   NaN        NaN       A3-C
5  A1-A        NaN        NaN   NaN       A2-B        NaN  A3-A        NaN        NaN

Key assumptions:

  1. Every column whose name DOES NOT start with Unnamed is actually the title of a question
  2. The columns between question titles represent options for the question on the left end of the column interval

Solution overview:

  1. Find indices of where each question starts and ends
  2. Flatten each question to a single column (pd.Series)
  3. Merge the question columns back together

Implementation (part 1):

indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
questions = [c for c in df.columns if not c.startswith('Unnamed')]
slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]

You can see that iterating for the over the slices like below you get a single DataFrame corresponding to each question:

for q in slices:
    print(df.iloc[:, q])  # Use `display` if using Jupyter

Implementation (part 2-3):

def parse_response(s):
    try:
        return s[~s.isnull()][0]
    except IndexError:
        return np.nan

data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
df = pd.concat(data, axis=1)
df.columns = questions

Output:

     Q1    Q2    Q3
1  A1-A  A2-B  A3-C
2  A1-B  A2-A  A4-B
3  A1-C  A2-B  A3-A
4  A1-B  A2-C  A3-C
5  A1-A  A2-B  A3-A

Upvotes: 4

Related Questions