Anzah Hayat
Anzah Hayat

Reputation: 21

How to add columns from multiple files in U-SQL in ADLA?

I have a lot of csv files in a Azure Data Lake, consisting of data of various types (e.g., pressure, temperature, true/false). They are all time-stamped and I need to collect them in a single file according to timestamp for machine learning purposes. This is easy enough to do in Java - start a filestream, run a loop on the folder that opens each file, compares timestamps to write relevant values to the output file, starting a new column (going to the end of the first line) for each file. While I've worked around the timestamp problem in U-SQL I'm having trouble coming up with syntax that will help me run this on the whole folder. The wildcard syntax {*} treats all files as the same fileset while I need to run some sort of loop to join a column from each file individually. Is there any way to do this, perhaps using virtual columns?

Upvotes: 2

Views: 298

Answers (1)

Michael Rys
Michael Rys

Reputation: 6684

First you have to think about your problem functional/declaratively and not based on procedural paradigms such as loops.

Let me try to rephrase your question to see if I can help. You have many csv files with data that is timestamped. Different files can have rows with the same timestamp, and you want to have all rows for the same timestamp (or range of timestamps) output to a specific file? So you basically want to repartition the data?

What is the format of each of the files? Do they all have the same schema or different schemas? In the later case, how can you differentiate them? Based on filename?

Let me know in the comments if that is a correct declarative restatement and the answers to my questions and I will augment my answer with the next step.

Upvotes: 1

Related Questions