Reputation: 21
I have a very large file in the Azure Data Lake store (257 gb), and when I tried to do a simple extract on it yesterday I got the following error
Vertex terminated as it ran for more than 5h hours. The input size of the vertex SV1_Extract_Partition[0][53].v0 with guid {2F8802B8-F93A-47EE-80E2-274590BD76A5} is 1.171594 GB. In most situations, this is caused by data skew such as one data partition containing most of the data. Use of different partitioning scheme or re-partitioning data can resolve such issue.
So I'm pretty sure what is happening is that U-SQL is not properly partitioning my file. I'm using a custom written extractor, but I don't see why this should be and issue.
How do I ensure that I partition my files. This mistake has cost me a lot of money (More than $2000), so I really don't want to run anything in this scale again before I can ensure that my files are properly partitioned when the job is running.
Do I really have to manually split my file into smaller files?
Upvotes: 2
Views: 492
Reputation: 6684
The partition size of about 1GB seems normal. The problem is probably in your custom extractor that it does process that data for over 5 hours.
I would suggest to investigate what your extractor does on that particular partition of the file.
Upvotes: 2