Reputation: 137
I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers.
Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:
I am now looking into pros and cons of both scenarios. Some things that come to my mind are:
One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://learn.microsoft.com/en-us/u-sql/ddl/tables):
First we can read that:
"U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?
Later we can read that:
"..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."
Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.
Therefore I have several questions:
Many thanks for your help,
Jakub
EDIT: I did some more tests and it only made it more intriguing.
Conclusion? If I am not mistaken, this looks like the rebuild will actually touch all my files no matter what I just inserted. If this is the case, it means that the whole scenario will become more and more expensive over time as the data grows. Is there anyone who could please explain I am wrong?
Upvotes: 4
Views: 226
Reputation: 14389
Microsoft have recently released a whitepaper called "U-SQL Performance Optimization" which you should read. It includes detailed notes on distribution, hashing v round-robin and partitioning.
Upvotes: 1