Reputation: 6431
I am looking on how to write a function that convert a dataframe in long format, like dim1|dim2|dim3|value, to an Array of values or a NamedArray, using missing
for when a given combination is not present, and without knowing a priori the number of columns e.g.:
using DataFrames, CSV
df = CSV.read(IOBuffer("""
reg prod var value
us banana production 10
us banana transfCoef 0.6
us banana trValues 2
us apples production 7
us apples transfCoef 0.7
us apples trValues 5
us juice production missing
us juice transfCoef missing
us juice trValues missing
eu banana production 5
eu banana transfCoef 0.7
eu apples production 8
eu apples transfCoef 0.8
eu apples trValues 4
eu juice production missing
eu juice transfCoef missing
eu juice trValues missing
"""), DataFrame, delim=" ", ignorerepeated=true, copycols=true, missingstring="missing")
Note that the df above is missing the row (eu, banana, trValues)
.
I did found a solution for NamedArray here, but as it uses reshape()
it doesn't work in situations like this.
Alternatively I can create an Array with all missing values by querying the unique values of the dimension columns and then creating the array with Array{Union{Missing,valueType},length(size)}(missing,size...)
but then I don't know on how to populate it (again, I don't know a priori that the dimensions are 3 like in the example)
Upvotes: 3
Views: 106
Reputation: 2580
There is a small zoo of packages which work a bit like this, and here's what I think you are asking, using one of them:
julia> using AxisKeys
julia> wrapdims(df, :value, :reg, :prod, :var)
3-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓ reg ∈ 2-element Vector{String}
→ prod ∈ 3-element Vector{String}
□ var ∈ 3-element Vector{String}
And data, 2×3×3 Array{Union{Missing, Float64}, 3}:
[:, :, 1] ~ (:, :, "production"):
("banana") ("apples") ("juice")
("us") 10.0 7.0 missing
("eu") 5.0 8.0 missing
[:, :, 2] ~ (:, :, "transfCoef"):
("banana") ("apples") ("juice")
("us") 0.6 0.7 missing
("eu") 0.7 0.8 missing
[:, :, 3] ~ (:, :, "trValues"):
("banana") ("apples") ("juice")
("us") 2.0 5.0 missing
("eu") missing 4.0 missing
The first symbol :value
tells it which column's data you want in the body of the array, with each other column being used as an axis of the array. You could do something like wrapdims(df, Symbol.(names(df))...; force=true, default=missing)
, or better wrapdims(df, :value, Symbol.(names(df)[1:end-1])...; default=NaN)
if you know :value
is the last column.
Upvotes: 1