How to efficiently join two directories that are already partitioned

Question

Suppose I have two different data sets A and B, and they are both already partitioned by joinKey and laid out in the filesystem like A/joinKey/ and B/joinKey in the filesystem.

For example, let's say joinKey is some hash value. Then, there may be directories like A/000/, A/001/, A/002/, and so on. And there is also B/000/, B/001/, B/002/, and so on.

I can only join with the same joinKey, so I can only join A/000/ with B/000/; I can only join A/001/ with B/001/; and so on.

What is the best way to write a Spark job to join the data in A/000/ with the data in B/000/ to produce output C/000/file; and then join the data in A/001/ with the data in B/001/ to produce output C/001/files; and so on?

My idea right now is that we would iterate over all joinKey values in Spark (e.g. do a simple for loop over the string values 000, 001, 002, etc.), and then load A/000/ as a DataFrame and load B/000/ as a DataFrame and join them; then, load A/001/ as a DataFrame and load B/001/ as a DataFrame and join. Is this reasonable, or is there a better approach?

For reference, when I say A/000/ could be a DataFrame, let's say its schema is . While B/000/ could have the schema , and I want to do another join of the two DataFrames by the username column. But this join could happen in one executor or node since both tables (A/000/ and B/000/) are small enough to fit in memory to not have to spread the data across multiple executors. So one executor would join A/000/ with B/000/; another executor would join A/001/ with B/001/.

The idea here is that the datasets A and B are already partitioned with the joinKey (000, 001, 002, etc.) in the filesystem. Only 000 data from A can join with 000 data from B, so it would be great to take advantage of this by having one executor take care of 000 data for A and 000 data for B. From there, the executor can do a join without shuffling or partitioning, and instead just keeping the entire DataFrames in memory and then further join by yet another joinKey, the username in this example.

How to efficiently join two directories that are already partitioned

Answers (1)

Related Questions