cat: write error: Resource temporarily unavailable

I am using cat command to merge multiple files using %sh command in Azure Databricks Notebook. I have around 1200 csv files in the data_files folder and the total size of files are around 300 GB. When I run the below code, sometimes it is merging the files without any errors, but sometimes it throws the error cat: write error: Resource temporarily unavailable and the output.txt file is created with no data.

err=$(cat /dbfs/mnt/devl/header_file/*.csv /dbfs/mnt/devl/data_files/*.csv 2>&1 > /dbfs/mnt/devl/output.txt)
RC=$?
if [ $RC -ne 0 ]; then
  echo "Error code : $RC"
  echo "Error msg : $err"
fi

Can anyone tell what is the rootcause of the error cat: write error: Resource temporarily unavailable and how to resolve this issue?

Upvotes: 0

Answers (1)

Jean Spector

Reputation: 1126

I don't have access to Azure, but if I had to guess - your issue has to do with the destination drive's write speed.

Explanation

Write speed is usually several times lower than the read speed and, with large enough source files, application can read more than could be written, filling the kernel buffer. Picture a water hose filling a bucket with a small hole - eventually it will fill up. Now, in this case we have 2 different behaviors:

If the destination file is open in blocking mode, write call will wait for the destination to flush enough data, and then resume writing.
If the destination file is open in on-blocking mode, write call will fail (return -1 and set errno to EAGAIN / EWOULDBLOCK). This provides the client application a lot of flexibility, especially in multithreaded use cases - but requires this condition to be handled explicitly.

You're using cat, which doesn't seem to handle non-blocking mode - and would simply error-out on EAGAIN.

Workarounds

I see a suggested workaround in a comment under https://unix.stackexchange.com/questions/613117/cat-resource-temporarily-unavailable:

perl -MFcntl -e 'fcntl STDIN, F_SETFL, fcntl(STDIN, F_GETFL, 0) & ~O_NONBLOCK'

Alternatively, you could look for a tool that does it better - or write your own utility. Here's an example of a tiny Python script that does what you're looking for. Note that I only tested it on tiny files, so YMMV.

#!/usr/bin/env python3

import shutil
import os


def concatenate_files(src_filenames: list[str], dst_filename: str) -> None:
  with open(dst_filename, 'wb') as outfile:
    for infile_name in src_filenames:
      with open(infile_name, 'rb') as infile:
        shutil.copyfileobj(infile, outfile)
      outfile.flush()
      os.fsync(outfile.fileno())


if __name__ == "__main__":
  # Take command line argument to make it actually useful
  input_files = ['file1.dat', 'file2.dat', 'file3.dat']
  output_file = 'result.dat'
  concatenate_files(input_files, output_file)

Upvotes: 0

cat: write error: Resource temporarily unavailable

Answers (1)

Related Questions