Reputation: 19707
Our underwriting company just sent us a data extract of leads. There are 9 million rows. The rows consist of LeadID (guid), RawLeadXML (xml -- probably 3-4kb max), and a LeadStatusID (int).
I first tried to add an autonumber integer and make it the primary key of this table. Well, it added the field, but couldn't make it the primary key (There is insufficient memory available in the buffer pool.)
What I need to do is to take every record, 1 by 1, and get the XML, put it into an XmlDocument object in .Net, strip out all the fields I want (firstname, lastname, etc) and store that parsed information into another table.
I can't even get this statement to run: select * from Leads where id between 1 and 1000
If I just select 1000 records at a time (select top 1000 * from Leads), that works, but how would I then select the next 1000 records without some sort of reference point?
My machine has 4 processors (2.53Ghz) and 12 GB of RAM. It's not a server, but it's a beefy machine. I don't know what to try next, honestly.
EDIT: I left out that the original file was actually an MDF (and the associated LDF) file, so I just attached to them in SQL Server.
EDIT 2: I messed up and said that the RawLeadXML column was XML -- it isn't, it's just nvarchar(max). I honestly didn't know there WAS an xml datatype.
EDIT 3: I can't even issue a delete statement on this table: "delete from leads where leadid = '100a7927-5311-4f12-8fe3-95c079d32dd4' " blows up with:
Msg 802, Level 17, State 20, Line 2
There is insufficient memory available in the buffer pool.
I have no idea what to do next. How in the hell is this even a problem? There are thousands of databases in the world with more records than I've got.
EDIT 4: In case anyone cares, none of the below solutions worked. I'm sure this is a limitation of my machine, and definitely not a condemnation of the fine answers I received below. Currently, I am transferring the zipped DB (2.6 GB) to our servers in Rackspace, and then I will attempt to add an index on that hardware, hopefully without taking down our production servers. Once the index is added, I'm hopeful that I can zip up the db and bring it back down to my local machine, and then be able to actually do something with it.
EDIT 5: My machine was literally incapable of handling a table of this size. My machine has 12 GB RAM, 64 bit Windows 7 Professional, a quad core 2.53Ghz processor, SSD drive, etc. It's pretty beefy for a development machine. And it couldn't handle this.
So, I moved the DB to our server in Rackspace in London. 48 GB or memory in that one, and it was able to add the index I needed. Even after that, my machine here was unable to do anything all that useful with it, so I've written a .Net program that is running in London to put out the records 1000 at a time, parse them into another table, and then mark the original records as processed.
Once I do this, I'm going to have to leave the DB in London, because I doubt I could write any meaningful reports against this monstrosity locally. Which will make development fun.
Summary: I believe that there is no good way to process a dataset this large without using server class hardware with at least 48 GB of RAM (in my case).
Upvotes: 8
Views: 8271
Reputation: 2354
How about just extracting the data to a text file straight from the Access DB - then you can just import it into the RDBM of your choice with a bulk insert.
Getting to Access DBs via link tables in MSSQL is not ideal. Here we use a read-only, forward-only recordset cursor to walk through the rows one at a time.
This is VBScript, so it won't be the fastest, but it should work. You'll have to tailor the XML extraction routine for your schema.
' Code run against a access DB with column RawLeadXML
Dim connection
Set connection = CreateObject("ADODB.Connection")
connection.Open "Driver={Microsoft Access Driver (*.mdb)}; DBQ=d:\path\to\file.mdb;"
Dim recs
' By default this is a read only, forward only cursor
Set recs = connection.Execute("SELECT * FROM Leads")
Dim fso
Set fso = CreateObject("Scripting.FileSystemObject")
Dim out
Set out = fso.OpenTextFile("d:\path\to\output.txt", 2, True)
Dim id
id = 0
While Not recs.EOF
id = id + 1
out.Write CStr(id) ' write an ID counter
out.Write ","
ExtractFieldsFromXML recs.Fields("RawLeadXML").Value, out
out.Write Chr(10) ' Linefeed
recs.MoveNext
Wend
out.Close
recs.Close
connection.Close
' Extract data from the XML and write it to the stream
' separated by commas
Sub ExtractFieldsFromXML(xml, out)
Dim doc
Set doc = CreateObject("MSXML2.DOMDocument")
MsgBox xml
doc.loadXML xml
out.Write doc.selectSingleNode("/xml/firstname").text
out.Write ","
out.Write doc.selectSingleNode("/xml/lastname").text
End Sub
Upvotes: 0
Reputation: 294207
Error 802 does not mean out of memory in the classical sense of insufficient memory to do an allocation (that would trigger error 701). Error 802 actually indicates that the buffer pool cannot grow, which can occur due to a couple of reasons:
If you still haven't found the problem, read How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005 (the article applies equally well to SQL Server 2008 and to 2008 R2) and follow the guidelines there to understand who/what is consuming your memory.
Upvotes: 9
Reputation: 3271
Adding a column is not an option, since adding an auto-id didn't work either. You migth want to store the altered/cleaned data in a new table. This table may have an auto-id and separate columns for the data your extracting from the Xml.
After inserting data into this table, you can delete the original rows from the source table. Or make a select statement, that excludes rows with a GUID that's already in the new table.
Since the rows of your source table have no depedency or specific order, the order in which you select / process your rows is not important.
Based on the comment, a suggestion for a SQL statement:
WHILE EXISTS(SELECT * FROM [source] [s]
WHERE NOT EXISTS(SELECT * FROM [destination] [d] WHERE [d].[leadId] = [s].[leadId]))
BEGIN
INSERT INTO [destination] ([leadId], [RawLeadXML], [LeadStatusId])
SELECT TOP 100 [s].[leadId], [s].[RawLeadXML], [s].[LeadStatusId]
FROM [source] [s]
WHERE NOT EXISTS(SELECT * FROM [destination] [d] WHERE [d].[leadId] = [s].[leadId])
END
I've set the number of records to insert at 100. That should spare your memory usage.
Upvotes: 3
Reputation: 3436
What about
Upvotes: 1
Reputation: 1496
9 million rows is not that big, you probably don't have an index on LeadId column. First create one, though it will take some time (doesn't have to be unique or primary key). Than use a "SELECT TOP 1000 LeadId, RawXML ORDER BY LeadId" for first query. Record last LeadId (MaxLeadId) value, use "SELECT TOP 1000 LeadId, RawXML Where LeadId > MaxLeadId ORDER BY LeadId" and so on...
Upvotes: 4
Reputation: 4941
Look into doing this by row_number() and ranks.
Take a look at this thread. It has the basic concepts to get you started.
Upvotes: 1
Reputation: 3208
What format are you getting the extract in originally? If it's a csv or somesuch, and you don't care about the other information in that xml fragment, I'd use SSIS and skip this intermediate table entirely. The native XML type has some overhead associated with it that you don't need to incur if you're just looking to shred some xml once and then discard it.
Otherwise, you should be able to use XML path statements directly in a sql query to get the data from one table to another. If you're trying to add a new PK from the designer, it may well fail on a large table. You will need to script the changes and run them manually, and possibly tweak them to make things more efficient. Ultimately 9m rows isn't that huge by modern standards, but you do have to think things through more than with 9k.
Upvotes: 0
Reputation: 1712
Do you have disk to spare? Perhaps creating the same table structure in a "TableAux" but with an Autonumeric Id and then doing an insert from table...
Import into your TableAux from a flat file(Exporting it first if it's not a flat file), is another way to do so.
Getting an Id for your registers is priority to work with them.
Upvotes: 1
Reputation: 613
If this is a flat file, could you perhaps get the first 1000 rows, load them and then go back and remove the rows from the file based off the GUID and then repeat? (Make a backup of the file first of course.)
Also you could try loading this file via an Integration Services package which should be able to track where it is at with the import and should be able to batch them and stay within your memory constraints.
Upvotes: 0