Reputation: 1

R: grouping a dataframe of sentences and identifying which columns share similar substrings.

I need some assistance organizing a bunch of survey questions that came in the form of a csv. The csv is of this format. Say that survey question 1 has multiple answers to choose from, like (a,b,c,d). In the csv it would look like this.

respondent Q1 Q1 Q1 Q1
1           a  
2              b    
3                 c
4                    d
5              b

The output of Q1 is spread over different columns for each possible answer. I would like to convert this to:

 respondent Q1 
 1           a  
 2           b    
 3           c
 4           d
 5           b

My problem is not with stacking it together, I mean I haven't figure that out yet, but I think I can find the resources for doing that.

My problem is that not all survey questions follow this format. So Q2 may take up 6 columns, Q3 just one column, etc.

One possible solution and the one that I would have to resort to if I have nothing better is to hardcode and count how many columns belong to each question and go through the csv, subset for those questions and organize them into single column.

This is a feasible solution, but I would prefer not to when I got like 20 something questions to look at. Therefore I was wondering if there's a way to cluster/group the questions together.

The column names for Q1 is not exactly the same, but very similar. For example:

Q1. How do you feel today? -Good.

Q1. How do you feel today? -Bad.

Basically, I want to facilitate the task of having to figure out how many columns belong to Q1, how many to Q2. Is it possible to use the similarity of column headers belonging to the same question and say these 5 columns belong to the same question.

Thanks!

Upvotes: 0

Answers (2)

Parfait

Reputation: 107577

Consider base function's apply():

df <- data.frame(df$respondent,
                 Q1=(apply(df, 1, max)))

#   respondent Q1
# 1          1  a
# 2          2  b
# 3          3  c
# 4          4  d
# 5          5  b

For multiple columns, consider binding apply functions using grep for columns of similar names which answers your above question:

Is it possible to use the similarity of column headers belonging to the same question?

df <- data.frame(df$respondent,
                 Q1=apply(df[,grep('Q1', names(df))], 1, max),
                 Q2=apply(df[,grep('Q2', names(df))], 1, max),
                 Q3=df$Q3,                                      # ONLY ONE COLUMN (NO APPLY)
                 Q4=apply(df[,grep('Q4', names(df))], 1, max))

Upvotes: 0

chinsoon12

Reputation: 25225

assuming that your survey results are all in a single file like this temp.csv:

respondent,Q1,Q1,Q1,Q1,Q2,Q2,Q3,Q4,Q4,Q4
1,a,,,,a,,a,,b,
2,,b,,,,b,a,a,,
3,,,c,,a,,a,,,c
4,,,,d,,b,a,,b,
5,,b,,,,b,a,,b,

this code below should get you what you want

library(dplyr)
survey <- read.csv("temp.csv",header=F,row.names=1,check.names=F)
group_by(data.frame(t(survey), check.names=F), respondent) %>%
    summarise_each(funs(.[.!=""]))

##results
##  respondent 1 2 3 4 5
##1         Q1 a b c d b
##2         Q2 a b a b b
##3         Q3 a a a a a
##4         Q4 b a c b b

#to get the orientation as what you want
write.table(t(ans), "answers.csv", sep=",", row.names=T, col.names=F)
read.csv("answers.csv")

If each survey question is in a different file, you can use lapply to read in all "temp(QN#).csv" into a list then and merge these files recursively using row.names.

Upvotes: 1

R: grouping a dataframe of sentences and identifying which columns share similar substrings.

Answers (2)

Related Questions