Neurax
Neurax

Reputation: 3757

Data Frame in R use like SQL, possibly using sqldf()

I'm not at all familiar with R. I have basic competency using MATLAB which I do like a lot.

My current task involves gathering nice statistical data and analyzing it. I have scraped my data from JSON into a data frame by using fromJSON in the RJSONIO lib. Then I removed NULL values with

  Stats <- lapply(Stats, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)})

then called

DF<-do.call("rbind", Stats) to get the JSON values packed into a data frame. This left me with full atomic vectors, so this was cleaned up using

DF<-as.data.frame(DF)

Now I'm left with a DF where I'd like to be able to perform query-esque (SQL) calculations, ie.. My research has led me to a library called sqldf, but it seems to have many dependencies that I can't get running on my machine. Still looking for a solution.

List all people who are on team "NYI"

or

Find total number of goals for team

or even most helpful

Lookup data by name or other key value

Sample data frame:

_ Name Team Opponent 167 Matt Carkner NYI PHI 168 Keith Ballard MIN FLA 169 Willie Mitchell FLA MIN 170 Rob Scuderi PIT BOS 171 Nate Prosser MIN FLA 172 Nick Schultz PHI NYI

Upvotes: 1

Views: 406

Answers (2)

Neurax
Neurax

Reputation: 3757

SQLDF turned out to be the PERFECT solution to my type of question.

I simply installed the package from the R console

install.packages("sqldf")

Since I'm running on an Apple system, some Xcode dependencies needed installing, but Software Update took care of that for me autonomously.

Then I simply used the

library(sqldf)

command to begin running SQL queries on my data frames.

Use is very much so similar to standard SQL, return is new DF with requested attributes. Wrapped up in a view tag makes it a perfect tool now.

View(sqldf("SELECT * from skaters WHERE Team='PHI'"))

Upvotes: 1

KFB
KFB

Reputation: 3501

Suggest data.table package.

# toy data
# I presume your first column of data is number of "goals"
df <- structure(list(Goals = 167:172, Name = structure(c(2L, 1L, 6L, 
5L, 3L, 4L), .Label = c("Keith Ballard", "Matt Carkner", "Nate Prosser", 
"Nick Schultz", "Rob Scuderi", "Willie Mitchell"), class = "factor"), 
    Team = structure(c(3L, 2L, 1L, 5L, 2L, 4L), .Label = c("FLA", 
    "MIN", "NYI", "PHI", "PIT"), class = "factor"), Opponent = structure(c(5L, 
    2L, 3L, 1L, 2L, 4L), .Label = c("BOS", "FLA", "MIN", "NYI", 
    "PHI"), class = "factor")), .Names = c("Goals", "Name", "Team", 
"Opponent"), class = c("data.table", "data.frame"), row.names = c(NA, 
-6L), .internal.selfref = <pointer: 0x001924a0>, index = structure(integer(0), "`__Team`" = c(3L, 
2L, 5L, 1L, 6L, 4L)))

illustrations on a few operations

setDT(df)  # convert data frame to data.table
df[Team == "NYI",]   # there's faster approach by first keying the concerned column
# you get
   Goals         Name Team Opponent
1:   167 Matt Carkner  NYI      PHI

df[,list(ttl_goals = sum(Goals)), by=Name]  # to get total goals per player
# you get
              Name ttl_goals
1:    Matt Carkner       167
2:   Keith Ballard       168
3: Willie Mitchell       169
4:     Rob Scuderi       170
5:    Nate Prosser       171
6:    Nick Schultz       172

Upvotes: 2

Related Questions