Julia DataFrames: Problems with Split-Apply-Combine strategy

Question

I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.

My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.

I was playing around with some strategies, and got a problem using the by function:

best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)

The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame

But when I used this strategy, I got:

ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
 in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
 in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
 in f at none:1
 in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
 in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202

I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:

best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)

Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?

Julia DataFrames: Problems with Split-Apply-Combine strategy

Answers (1)

Related Questions