Reputation: 35
I have annual country data for different variables which is in CSV file. The data has many countries and regions. Below is just a simple example how the data looks like for the first five rows.
region LAM LAM LAM LAM LAM LAM
country Brazil Brazil Brazil Peru Peru Peru
variable FC FP FCO FC FP FCO
1850 10 20 30 15 25 16
1851 10 20 30 15 25 16
Once I read in the CSV file as a data frame in R, I would like to transform it as below to make it easy to work with.
region country year variable amount
LAM Brazil 1850 FC 10
LAM Brazil 1851 FC 10
LAM Brazil 1850 FP 20
LAM Brazil 1850 FP 20
LAM Brazil 1850 FCO 30
LAM Brazil 1850 FCO 30
LAM Peru 1850 FC 15
Does anyone know the easiest way to do this?
Upvotes: 2
Views: 113
Reputation: 1081
library(data.table)
(df <- fread("
region LAM LAM LAM LAM LAM LAM
country Brazil Brazil Brazil Peru Peru Peru
variable FC FP FCO FC FP FCO
1850 10 20 30 15 25 16
1851 10 20 30 15 25 16", header = FALSE))
df <- setnames(transpose(df), df[, V1]) # transpose df and set col names, where the first column of df is the var names.
df <- df[-1, ] # then our df is df without the first row
df_long <- melt(df, id.vars = c("region", "country", "variable"), variable.name = "year", value.name = "amount")
df_long
region country variable year amount
1 LAM Brazil FC 1850 10
2 LAM Brazil FP 1850 20
3 LAM Brazil FCO 1850 30
4 LAM Peru FC 1850 15
5 LAM Peru FP 1850 25
6 LAM Peru FCO 1850 16
7 LAM Brazil FC 1851 10
8 LAM Brazil FP 1851 20
9 LAM Brazil FCO 1851 30
10 LAM Peru FC 1851 15
11 LAM Peru FP 1851 25
12 LAM Peru FCO 1851 16
Upvotes: 1
Reputation: 193497
I'd create a function like one of the following. I've listed the functions from fastest to slowest on a test of a csv file that, when transposed, would result in a 53 column, 100000 row dataset. One of the read_transposed
options would be faster on smaller datasets.
All of these use fread
, so make sure you have a library(data.table)
in your code.
csvtool
+ fread
csvtool
has a very fast transpose
operation that can then be quickly read using fread
.
fread_csvtool <- function(infile, header = TRUE, ...) {
fread(cmd = sprintf("cat %s | csvtool transpose - | cat", infile), header = header, ...)
}
fread
+ further processing (1)read_transposed_1 <- function(infile) {
temp <- fread(file = infile, header = FALSE)
setnames(transpose(temp[, -1]), temp[[1]])[, lapply(.SD, type.convert)]
}
fread
+ further processing (2)read_transposed_2 <- function(infile) {
temp <- lapply(fread(file = infile, sep = "\n", header = FALSE)$V1, function(x) {
scan(what = "", text = x, sep = ",", quiet = TRUE)
})
setnames(setDT(lapply(temp, function(x) type.convert(x[-1L]))),
vapply(temp, '[', character(1L), 1L))[]
}
After the data have been read, you can use your preferred method for converting the wide dataset into a long dataset. Since these all result in a data.table
, I'd suggest:
melt(the_output, id.vars = 1:3)
x <- tempfile(fileext = ".csv")
writeLines(c("region,LAM,LAM,LAM,LAM,LAM,LAM",
"country,Brazil,Brazil,Brazil,Peru,Peru,Peru",
"variable,FC,FP,FCO,FC,FP,FCO",
"1850,10,20,30,15,25,16",
"1851,10,20,30,15,25,16"), x)
fread_csvtool(x)
## region country variable 1850 1851
## 1: LAM Brazil FC 10 10
## 2: LAM Brazil FP 20 20
## 3: LAM Brazil FCO 30 30
## 4: LAM Peru FC 15 15
## 5: LAM Peru FP 25 25
## 6: LAM Peru FCO 16 16
melt(fread_csvtool(x), id.vars = 1:3, variable.name = "year", value.name = "amount")
## region country variable year amount
## 1: LAM Brazil FC 1850 10
## 2: LAM Brazil FP 1850 20
## 3: LAM Brazil FCO 1850 30
## 4: LAM Peru FC 1850 15
## 5: LAM Peru FP 1850 25
## 6: LAM Peru FCO 1850 16
## 7: LAM Brazil FC 1851 10
## 8: LAM Brazil FP 1851 20
## 9: LAM Brazil FCO 1851 30
## 10: LAM Peru FC 1851 15
## 11: LAM Peru FP 1851 25
## 12: LAM Peru FCO 1851 16
set.seed(1)
n <- 100001
years <- 50
X1 <- replicate(3, stringi::stri_rand_strings(n, 5), FALSE)
X2 <- replicate(years, runif(n), FALSE)
transposed_data <- data.table(t(cbind(as.data.table(X1), as.data.table(X2))))
transposed_data[, V1 := c(paste0("Var", 1:3), 1850:(1850+years-1))]
fwrite(transposed_data, file = "transposed.csv", col.names = FALSE)
fread_csvtool("transposed.csv")
Upvotes: 0
Reputation: 51894
as_tibble(t(df)) %>% #Transpose the df
janitor::row_to_names(1) %>% #put the first row as column names
pivot_longer(c(`1850`,`1851`),names_to = "date",values_to="value") #pivot the df to make it tidier
Something like this gives :
# A tibble: 12 x 5
region country variable date value
<chr> <chr> <chr> <chr> <chr>
1 LAM Brazil FC 1850 10
2 LAM Brazil FC 1851 10
3 LAM Brazil FP 1850 20
4 LAM Brazil FP 1851 20
5 LAM Brazil FCO 1850 30
6 LAM Brazil FCO 1851 30
7 LAM Peru FC 1850 15
8 LAM Peru FC 1851 15
9 LAM Peru FP 1850 25
10 LAM Peru FP 1851 25
11 LAM Peru FCO 1850 16
12 LAM Peru FCO 1851 16
I've loaded your table with read_csv2("your working directory",col_names=F)
.
Upvotes: 1
Reputation: 120
Lets say your data is more or less this way (it is useful, as Sotos said, to share reproducible examples):
mm <- matrix(
c("region", "LAM", "LAM", "LAM", "LAM", "LAM", "LAM",
"country", "Brazil", "Brazil", "Brazil", "Peru", "Peru", "Peru",
"variable", "FC", "FP", "FCO", "FC", "FP", "FCO",
"1850", 10, 20, 30, 15, 25, 16,
"1851", 10, 20, 30, 15, 25, 16),
byrow = TRUE,
ncol = 7)
varnames <- mm[,1] # to have variable names
mm <- t(mm) # we transpose it
mm <- as.data.frame(mm) # as data frame
colnames(mm) <- varnames # we set variable names
mm <- mm[-1, ] # this first row should not be there,
# probably this would be different as you import from a csv
library(tidyr)
mm <- pivot_longer(data = mm,
cols = c("1850", "1851"),
names_to = "year",
values_to = "amount")
I think this is what you are looking for
mm
# A tibble: 12 x 5
region country variable year amount
<fct> <fct> <fct> <chr> <fct>
1 LAM Brazil FC 1850 10
2 LAM Brazil FC 1851 10
3 LAM Brazil FP 1850 20
4 LAM Brazil FP 1851 20
5 LAM Brazil FCO 1850 30
6 LAM Brazil FCO 1851 30
7 LAM Peru FC 1850 15
8 LAM Peru FC 1851 15
9 LAM Peru FP 1850 25
10 LAM Peru FP 1851 25
11 LAM Peru FCO 1850 16
12 LAM Peru FCO 1851 16
Upvotes: 0