Reputation: 21
I have a dataframe that contains data from a questionnaire in wide format. But one column contains compressed experimental data with csv formatted data (as a character string) in long format. So i want to reverse this compression and restore a full tabular dataset. Values from the wide-format should be repeated to fill added rows.
So this would be my dataframe:
df_wide <- data.frame(v1 = c("1", "2", "3"),
v2 = c("sender, correct, \n1, y, \n2, y,",
"sender, correct, \n2, n, \n1, n,",
"sender, correct, \n1, n, \n2, y,"))
colnames(df_wide) <- c("ID", "csvcolumn")
df_wide$csvcolumn <- as.character(df_wide$csvcolumn)
> df_wide
> ID csvcolumn
>1 1 sender, correct, \n1, y, \n2, y,
>2 2 sender, correct, \n2, n, \n1, n,
>3 3 sender, correct, \n1, n, \n2, y,
and i want it to look like this:
> df_long
ID sender correct
1 1 y
1 2 y
2 2 n
2 1 n
3 1 n
3 2 y
I am able to read the csv column and save it as a new dataframe with
df2 <- read.table(text=df_wide$csvcolumn, sep =",", header=TRUE, allowEscapes = TRUE)
but obviously this doesn't include the wide-format data and i don't know which data belongs to which ID.
I hope this is somehow possible to solve and i am grateful for your help.
Upvotes: 2
Views: 83
Reputation: 347
I would question the input used here. Probably refine in Excel to an input that makes more sense to R. Not always possible however...
To clean this, first remove the unrequired header per csvcolumn
, and split on \n
characters (need tidyverse
!)
library(tidyverse)
df_n <- df_wide %>% dplyr::mutate(n = gsub("sender, correct, \n",
"",
csvcolumn)) %>%
dplyr::select(1,3) %>%
tidyr::separate(data=.,
col="n",
sep="\n",
into=c("s1","c1"))
Then use similar to your read.csv
trick, but with read_csv
df_list <- lapply(df_n, function(f){
read_csv(as.vector(f), col_names=F)
})
Then you can call rbind
to get your questionnaire data, and then ID
by total rows divided by length of ID
qdat <- do.call(rbind,df_list[2:length(df_list)]) %>%
dplyr::select(1,2) %>%
as_tibble()
df_long <- qdat %>%
dplyr::mutate(ID = rep(unlist(df_list[[1]]),
dim(qdat)[1]/dim(df_list[[1]])[1])) %>%
dplyr::select(ID, sender="X1", correct = "X2")
df_long
# A tibble: 6 x 3
ID sender correct
<dbl> <dbl> <chr>
1 1 1 y
2 2 2 n
3 3 1 n
4 1 2 y
5 2 1 n
6 3 2 y
Upvotes: 0