Antonello
Antonello

Reputation: 6431

How to convert a DataFrame in long format (potentially sparse) to multi-dimensional Array or NamedArray

I am looking on how to write a function that convert a dataframe in long format, like dim1|dim2|dim3|value, to an Array of values or a NamedArray, using missing for when a given combination is not present, and without knowing a priori the number of columns e.g.:

using DataFrames, CSV

df = CSV.read(IOBuffer("""
reg prod var value
us banana production 10
us banana transfCoef    0.6
us banana trValues      2
us apples production    7
us apples transfCoef    0.7
us apples trValues      5
us juice production     missing
us juice transfCoef     missing
us juice trValues       missing
eu banana production    5
eu banana transfCoef    0.7
eu apples production    8
eu apples transfCoef    0.8
eu apples trValues 4
eu juice production missing
eu juice transfCoef missing
eu juice trValues missing
"""), DataFrame, delim=" ", ignorerepeated=true, copycols=true, missingstring="missing")

Note that the df above is missing the row (eu, banana, trValues).

I did found a solution for NamedArray here, but as it uses reshape() it doesn't work in situations like this.

Alternatively I can create an Array with all missing values by querying the unique values of the dimension columns and then creating the array with Array{Union{Missing,valueType},length(size)}(missing,size...) but then I don't know on how to populate it (again, I don't know a priori that the dimensions are 3 like in the example)

Upvotes: 3

Views: 106

Answers (1)

mcabbott
mcabbott

Reputation: 2580

There is a small zoo of packages which work a bit like this, and here's what I think you are asking, using one of them:

julia> using AxisKeys

julia> wrapdims(df, :value, :reg, :prod, :var)
3-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   reg ∈ 2-element Vector{String}
→   prod ∈ 3-element Vector{String}
□   var ∈ 3-element Vector{String}
And data, 2×3×3 Array{Union{Missing, Float64}, 3}:
[:, :, 1] ~ (:, :, "production"):
          ("banana")  ("apples")  ("juice")
  ("us")  10.0         7.0          missing
  ("eu")   5.0         8.0          missing

[:, :, 2] ~ (:, :, "transfCoef"):
          ("banana")  ("apples")  ("juice")
  ("us")   0.6         0.7          missing
  ("eu")   0.7         0.8          missing

[:, :, 3] ~ (:, :, "trValues"):
          ("banana")  ("apples")  ("juice")
  ("us")   2.0         5.0          missing
  ("eu")    missing    4.0          missing

The first symbol :value tells it which column's data you want in the body of the array, with each other column being used as an axis of the array. You could do something like wrapdims(df, Symbol.(names(df))...; force=true, default=missing), or better wrapdims(df, :value, Symbol.(names(df)[1:end-1])...; default=NaN) if you know :value is the last column.

Upvotes: 1

Related Questions