Reputation: 5893
I have a dataset that has 57 million rows and 23 columns. There is a column with species names of different birds (about 2000 unique names), and I would like to pull out two columns of data (latitude, longitude) for each unique species name, and write to file the lat/long data for each species, with the species name as the file name. This takes too long to do from R, the only language I know. What would the appropriate code be for this task?
I am attempting some pseudo-code here to demonstrate what I am guessing the code might look like, roughly:
FOR i IN 1:unique(species_name)
SELECT latitude,longitude WHERE species_name=[i]
WRITE [some code that writes a text file with species name as the file name]
LOOP END;
I imagine I can do this sort of thing in Terminal on OSX?
EDIT 20111211: Here is my workflow from R:
require(RMySQL);
require(plyr)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, user = "asdfaf", dbname = "test", host = "localhost")
splist <- read.csv("splist_use.csv")
sqlwrite <- function(spname) {
cat(spname) g1 <- dbGetQuery(con
, paste("SELECT col_16,col_18 FROM dat WHERE col_11='"
, spname, "'", sep="")
)
write.csv(g1, paste(spname, ".csv", sep=""))
rm("g1") }
l_ply(splist, sqlwrite, .progress="text" )
Upvotes: 4
Views: 3239
Reputation: 192
Have you tried using the MySQL outfile functionality?
SELECT col_16,col_18
FROM
dat
WHERE col_11= spiecesname
INTO OUTFILE '/tmp/spiecesname.csb'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
With a little bit of work you could make MySQL pick out each unique Species name, loop through the results and output to a unique csv file.
You should have an order by clause on your query
Upvotes: 1
Reputation: 35323
is an Excel file output acceptable and do you have excel? If so you can use Excel to connect to the database, and issue a your query to extract the data. it's then in .xls or csv. format. However this assumes you results are less than 1,000,000.
In excel go to data tab select from other sources select and enter preferred method of connection From here you can define a table or query to run. (if results are less than the number of rows supported on the sheet for the excel version you are on) then using the method you've selected will extract the files. Should be faster than the IO your currently doing.
Upvotes: 0
Reputation: 44240
IMHO the best thing you can do is use a scripting language (python, perl, php, shell) and generate the filenames and queries from there. It is not too difficult, but you will have to learn a different language. SQL is not suitable for imperative programming.
Upvotes: 1