NatalieEV
NatalieEV

Reputation: 23

Replacing carriage return and line feeds in a text file using Azure Synapse

I am currently working on a project where I have to get data from an Excel file in SharePoint, this file has multiple tabs, and I am using Azure Synapse Analytics to call a SharePoint API to get each tab, and save it into Azure Blob storage as a 'File', although I could save this as text or CSV (the CSV would be messed up though due to issues explained below).

The issue I am having is in one of the Excel columns, there are free text fields that contain carriage returns and line breaks, although this is not apparent until after it has been converted to CSV.

I need to get this file into a SQL table in our Data Warehouse, however, each row is differentiated by a \\r\\n, which is how Synapse would get it into the table.

In the file I have, every 'real' line break (end of a row that I need) is preceded by a comma, so I plan to replace any ,\\r\\n with something like !\*\*NEWLINE\*\*! then remove all remaining \\r\\n by replacing with an empty string, then replacing !\*\*NEWLINE\*\*! with ,\\r\\n.

This does work when doing it in Notepad++ but I need this to be automated with multiple files weekly. I am trying to work out the best way to do this, I am thinking of using a Notebook and PySpark, although I am having issues with formatting as it doesn't seem the most appropriate due to having to use dataframes.

I am wondering if anyone has any suggestions that can help me out, either with a way to get rid of carriage returns that are hiding in Excel (not manually), or suggestions on how I can use a notebook, or any other components in Azure Synapse to get this done. I am happy to be pointed in the right direction in terms of ideas I can research if no definitive answer is possible.

I have tried to achieve this by using a notebook in PySpark but when following examples/tutorials, it seems to be using dataframes, and specifying which column to do a find and replace in. I am unable to do this due to the file being in an awkward format because of the line breaks/carriage returns.

Upvotes: 1

Views: 145

Answers (1)

Sam Malcolm
Sam Malcolm

Reputation: 13

What sort of data sizes are you looking at? You could use the spark node just like a normal python machine and load it into memory as a string, call the replace method on the string (using your logic above) and then convert to a dataframe to do any further transformations that you require or just write back to ADLS? This obviously isn't ideal as it's not leveraging the parallelism of spark of dataframes but could work as a workaround if your data sizes allow for it.

Upvotes: 1

Related Questions