Reputation: 845
I have a csv file that as ~1.9M rows and 32 columns. I also have limited RAM, which makes it loading into the memory very inconvenient. As a result I am thinking of using a database but do not have any intimate knowledge on the subject and so have have looked around at this site but found no viable solns so far.
The CSV file looks like this:
Case,Event,P01,P02,P03,P04,P05,P06,P07,P08,P09,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30
C000039,E97553,8,10,90,-0.34176313227395744,-5.581162038780728E-4,-0.12090388100201072,-1.5172412910939355,-0.9075283173030568,2.0571877671625742,-0.002902632819930783,-0.6761896565590585,-0.7258602353522214,0.8684602429202587,0.0023189312896576167,0.002318939470525324,-0.1881462494296103,-0.0014303471592995315,-0.03133299206977217,7.72338072867324E-4,-0.08952068388668191,-1.4536398437657685,-0.020065144945600275,-0.16276139919188118,0.6915962670997067,-1.593412697264055,-1.563877781707804,-1.4921751129092755,4.701551108078644,6,-0.688302560842075
C000039,E23039,8,10,90,-0.3420173545012358,-5.581162038780728E-4,-1.6563770995734233,-1.5386562526752448,-1.3604342580422861,2.1025445031625525,-0.0028504751366762804,-0.6103972392687121,-2.0390388918403284,-1.7249948885013526,0.00231891181914203,0.0023189141684282384,-0.18603688853814693,-0.0014303471592995315,-0.03182759137355937,0.001011754948131039,0.13009444290656555,-1.737249614361576,-0.015763602969926262,-0.16276139919188118,0.7133868949811379,-1.624962995908364,-1.5946762525901037,-1.5362787555380522,4.751479927607516,6,-0.688302560842075
C000039,E23039,35,10,90,-0.3593468363273839,-5.581162038780728E-4,-2.2590624066428937,-1.540784192984501,-1.3651511418164592,0.05539868728273849,-0.00225912499740972,0.20899232681704485,-2.2007336302050633,-2.518401278903022,0.0023189850665203673,0.0023189834133465186,-0.1386548782028836,-0.0013092574968056093,-0.0315006293688149,9.042390365542781E-4,-0.3514180333671346,-1.8007561969675518,-0.008593259125791147,-2.295351187387221,0.6329101442826701,-1.8095530459660578,-1.7748676145152822,-1.495347406256394,2.553693742122162,34,-0.6882806822066699
.... .... upto 1.9 M rows
As you can see the 'Case' column repeats itself but I want to only get unique records before importing it into a dataframe. So i used this:
f<-file("test.csv")
bigdf <- sqldf("select * from 'f' where Case in (select Case from 'f' group by Case having count(*) = 1)", dbname = tempfile(), file.format = list(header = T, row.names = F))
However I get this error:
Error in sqliteExecStatement(con, statement, bind.data) :
RS-DBI driver: (error in statement: near "in": syntax error)
Is there something obvious I am missing here. Much thanks in advance.
Upvotes: 1
Views: 5876
Reputation: 14054
For those who want unique rows using sqldf
, use DISTINCT
:
newdf <- sqldf("SELECT DISTINCT * FROM df") # Get unique rows
sqldf
uses SQLite Syntax by default.
newdf <- sqldf("SELECT DISTINCT name FROM df") # Get unique column values
newdf <- sqldf("SELECT *, COUNT(DISTINCT name) as NoNames FROM df GROUP BY whatever") # Get a count of unique names
Upvotes: 0
Reputation: 1
if you use "Case" in sqldf in R, you should put a "," before "Case". Because the "Case" query is the whole line, you should make it seperate.
Upvotes: -1