USQL Query for Huge Files

Question

I have a very large file in the Azure Data Lake store (257 gb), and when I tried to do a simple extract on it yesterday I got the following error

Vertex terminated as it ran for more than 5h hours. The input size of the vertex SV1_Extract_Partition[0][53].v0 with guid {2F8802B8-F93A-47EE-80E2-274590BD76A5} is 1.171594 GB. In most situations, this is caused by data skew such as one data partition containing most of the data. Use of different partitioning scheme or re-partitioning data can resolve such issue.

So I'm pretty sure what is happening is that U-SQL is not properly partitioning my file. I'm using a custom written extractor, but I don't see why this should be and issue.

How do I ensure that I partition my files. This mistake has cost me a lot of money (More than $2000), so I really don't want to run anything in this scale again before I can ensure that my files are properly partitioned when the job is running.

Do I really have to manually split my file into smaller files?

USQL Query for Huge Files

Answers (1)

Related Questions