cantdutchthis
cantdutchthis

Reputation: 34537

julia create an empty dataframe and append rows to it

I am trying out the Julia DataFrames module. I am interested in it so I can use it to plot simple simulations in Gadfly. I want to be able to iteratively add rows to the dataframe and I want to initialize it as empty.

The tutorials/documentation on how to do this is sparse (most documentation describes how to analyse imported data).

To append to a nonempty dataframe is straightforward:

df = DataFrame(A = [1, 2], B = [4, 5])
push!(df, [3 6])

This returns.

3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1   | 1 | 4 |
| 2   | 2 | 5 |
| 3   | 3 | 6 |

But for an empty init I get errors.

df = DataFrame(A = [], B = [])
push!(df, [3, 6])

Error message:

ArgumentError("Error adding 3 to column :A. Possible type mis-match.")
while loading In[220], in expression starting on line 2

What is the best way to initialize an empty Julia DataFrame such that you can iteratively add items to it later in a for loop?

Upvotes: 49

Views: 26148

Answers (4)

phntm
phntm

Reputation: 541

I think at least in the latest version of Julia you can achieve this by creating a pair object without specifying type

df = DataFrame("A" => [], "B" => [])
push!(df, [5,'f'])

1×2 DataFrame
 Row │ A    B   
     │ Any  Any 
─────┼──────────
   1 │ 5    f

as seen in this post by @Bogumił Kamiński where multiple columns are needed, something like this can be done:

entries = ["A", "B", "C", "D"]
df = DataFrame([ name =>[] for name in entries])
julia> push!(df,[4,5,'r','p'])
1×4 DataFrame
 Row │ A    B    C    D   
     │ Any  Any  Any  Any 
─────┼────────────────────
   1 │ 4    5    r    p

Or as pointed out by @Antonello below if you know that type you can do.

df = DataFrame([name => Int[] for name in entries])

which is also in @Bogumil Kaminski's original post.

Upvotes: 2

wueli
wueli

Reputation: 1209

The answer from @waTeim already answers the initial question. But what if I want to dynamically create an empty DataFrame and append rows to it. E.g. what if I don't want hard-coded column names?

In this case, df = DataFrame(A = Int64[], B = Int64[]) is not sufficient. The NamedTuple A = Int64[], B = Int64[] needs to be create dynamically.

Let's assume we have a vector of column names col_names and a vector of column types colum_types from which to create an emptyDataFrame.

col_names = [:A, :B] # needs to be a vector Symbols
col_types = [Int64, Float64]
# Create a NamedTuple (A=Int64[], ....) by doing
named_tuple = (; zip(col_names, type[] for type in col_types )...)

df = DataFrame(named_tuple) # 0×2 DataFrame

Alternatively, the NameTuple could be created with

# or by doing
named_tuple = NamedTuple{Tuple(col_names)}(type[] for type in col_types )

Upvotes: 5

The Unfun Cat
The Unfun Cat

Reputation: 31918

using Pkg, CSV, DataFrames

iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"))

new_iris = similar(iris, nrow(iris))

head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1   │ missing     │ missing    │ missing     │ missing    │ missing │
# │ 2   │ missing     │ missing    │ missing     │ missing    │ missing │

for (i, row) in enumerate(eachrow(iris))
    new_iris[i, :] = row[:]
end

head(new_iris, 2)

# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa  │
# │ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │

Upvotes: 4

waTeim
waTeim

Reputation: 9235

A zero length array defined using only [] will lack sufficient type information.

julia> typeof([])
Array{None,1}

So to avoid that problem is to simply indicate the type.

julia> typeof(Int64[])
Array{Int64,1}

And you can apply that to your DataFrame problem

julia> df = DataFrame(A = Int64[], B = Int64[])
0x2 DataFrame

julia> push!(df, [3  6])

julia> df
1x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1   | 3 | 6 |

Upvotes: 49

Related Questions