Reputation: 21
I need to join a huge table of sales with more than 600 million rows with additional information from other tables. The first part of the flow looks this way:
It starts sorting data but after 190M rows, it fails with the error about lack of disk space.
Information: 0x4004300C at load NT_sales by customers, SSIS.Pipeline: Execute phase is beginning.
Error: 0x80070070 at load NT_sales by customers:
Error: 0xC004704A at load NT_sales by customers: The buffer manager cannot extend the file "D:\Users\BUB2523\AppData\Local\Temp\17\DTS{1CF72EC9-E25F-49B1-AE1E-52296F16E0F2}.tmp" to length 4640004 bytes. There was insufficient disk space.
Error: 0xC0047048 at load NT_sales by customers:
Information: 0x4004800D at load NT_sales by customers: The buffer manager failed a memory allocation call for 4640000 bytes, but was unable to swap out any buffers to relieve memory pressure. 1 buffer was considered and 0 were locked. Either not enough memory is available to the pipeline because not enough are installed, other processes were using it, or too many buffers are locked.
Information: 0x4004800F at load NT_sales by customers: Buffer manager allocated 2801 megabyte(s) in 603 physical buffer(s).
Information: 0x40048010 at load NT_sales by customers: Component "Sort" (1325) owns 2585 megabyte(s) physical buffer.
Information: 0x40048010 at load NT_sales by customers: Component "Merge Join 4" (572) owns 27 megabyte(s) physical buffer.
Information: 0x40048010 at load NT_sales by customers: Component "Merge Join 2" (415) owns 23 megabyte(s) physical buffer.
Information: 0x40048010 at load NT_sales by customers: Component "SUM and GROUP BY" (1930) owns 23 megabyte(s) physical buffer.
Information: 0x40048010 at load NT_sales by customers: Component "Sort by cat_2" (1375) owns 21 megabyte(s) physical buffer.
Information: 0x40048010 at load NT_sales by customers: Component "Merge Join 3" (495) owns 21 megabyte(s) physical buffer.
Information: 0x40048010 at load NT_sales by customers: Component "Sort by cat_3" (1505) owns 21 megabyte(s) physical buffer.
Information: 0x40048010 at load NT_sales by customers: Component "Sort by EAN" (1640) owns 13 megabyte(s) physical buffer.
Information: 0x40048010 at load NT_sales by customers: Component "Merge Join 1" (349) owns 13 megabyte(s) physical buffer.
Information: 0x40048010 at load NT_sales by customers: Component "Sort by FK_store_code" (1770) owns 12 megabyte(s) physical buffer.
I tried to avoid sorting by ordering the data from the source, but then it fails in a similar way on Merge Join. I've decreased the property DefaultBufferMaxRows
, but it doesn't help. The process is passing slower, but it fails at the same point, and it consumes all of the available disk space (approximately 80 GB).
Can I split it into batches somehow, or maybe there is an instrument in SSIS to make this project execute multiple times with smaller sizes of data?
P.s Thanks for your answers. I've made partitioning by weeks and sorting from DBs with ORDER BY queries but there is still a problem:
This sorting is never ending. 25M rows is not too many in my opinion but still for some reasons this sort not finishes even after few hours. And i have to do four of them in this flow. May be there are some properties to change or smth? Ni i have no idea whats going on.
Upvotes: 1
Views: 1306
Reputation: 61269
If I have a bag of numbers written out and ask you to sort them, you can't finish the task until you've taken every number out of the bag because the last one might affect the ordering of the rest. The same holds true here. The sort operation cannot complete until all 600M rows have arrived. In SSIS parlance, a sort
operation is an asynchronous/fully blocking component. Asynchronous components effectively halve the amount of memory available to your process because the data is physically copied from one memory address to another. As a general rule, SSIS operates on pointers to an address which is how SSIS set the ETL record back in the day. You have 5 sorts visible in your screenshot so you either need to have an amazingly large amount of memory available or you need to restructure your workflow.
The easy and correct answer is write a query to do the joins. You/your DBAs can tune a query all day long to make it fast, less resource intensive, etc. The database engine can make use of indexes, statistics, etc to make efficient data access. SSIS has none of that available to it. It's going to pull in every bit of data into the pipeline and only then will it discover - oh I didn't need that. Plus network cost, memory and CPU cost etc. As always, do as much as possible in your source systems. For trivial workloads on commodity hardware, your pattern, while slow and inefficient, is fine. Half a billion rows puts you into a different category where things start to matter.
You might find that you still need to apply the next set of optimizations to your data flow but rewriting the data extraction into a proper query starts the work.
If the data supports it, segment the number of rows that flow into the system. If the destination only needs 2021 data, then filter the source systems by adding a WHERE clause for WHERE SRC.insert_date >= '2021-01-01' AND SRC.insert_date < '2022-01-01';
If you need to get all data, this still applies but you'll need to add a ForEach enumerator outside the Data Flow and cycle through the years.
A merge join is needed if one source row can match multiple target rows. If it's ever a one-to-zero or one-to-one match, a Lookup is going to be way more effective. With the Lookup, you can also apply the Horizontal segmentation from above to limit the amount of data you can pick from. The lookup eliminates the costly sort operation and might do the same thing (as long as we're not working with a one-to-many situation)
Upvotes: 1