prcastro
prcastro

Reputation: 2256

Julia DataFrames: Problems with Split-Apply-Combine strategy

I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.

My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.

I was playing around with some strategies, and got a problem using the by function:

best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)

The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame

But when I used this strategy, I got:

ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
 in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
 in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
 in f at none:1
 in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
 in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202

I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:

best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)

Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?

Upvotes: 2

Views: 967

Answers (1)

Mr Alpha
Mr Alpha

Reputation: 1843

I think this might work, if I've understood you correctly:

# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA

# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])



    State   mortality             hospital
1   CA      0.9469632421111882    j
2   MA      0.7137144590022733    f
3   PA      0.8811901895164764    e

Upvotes: 3

Related Questions