Reputation: 21381
I am using cat
command to merge multiple files using %sh
command in Azure Databricks Notebook. I have around 1200 csv files in the data_files
folder and the total size of files are around 300 GB. When I run the below code, sometimes it is merging the files without any errors, but sometimes it throws the error cat: write error: Resource temporarily unavailable
and the output.txt
file is created with no data.
err=$(cat /dbfs/mnt/devl/header_file/*.csv /dbfs/mnt/devl/data_files/*.csv 2>&1 > /dbfs/mnt/devl/output.txt)
RC=$?
if [ $RC -ne 0 ]; then
echo "Error code : $RC"
echo "Error msg : $err"
fi
Can anyone tell what is the rootcause of the error cat: write error: Resource temporarily unavailable
and how to resolve this issue?
Upvotes: 0
Views: 405
Reputation: 1126
I don't have access to Azure, but if I had to guess - your issue has to do with the destination drive's write speed.
Explanation
Write speed is usually several times lower than the read speed and, with large enough source files, application can read more than could be written, filling the kernel buffer. Picture a water hose filling a bucket with a small hole - eventually it will fill up. Now, in this case we have 2 different behaviors:
write
call will wait for the destination to flush enough data, and then resume writing.write
call will fail (return -1
and set errno
to EAGAIN
/ EWOULDBLOCK
). This provides the client application a lot of flexibility, especially in multithreaded use cases - but requires this condition to be handled explicitly.You're using cat
, which doesn't seem to handle non-blocking mode - and would simply error-out on EAGAIN
.
Workarounds
I see a suggested workaround in a comment under https://unix.stackexchange.com/questions/613117/cat-resource-temporarily-unavailable:
perl -MFcntl -e 'fcntl STDIN, F_SETFL, fcntl(STDIN, F_GETFL, 0) & ~O_NONBLOCK'
Alternatively, you could look for a tool that does it better - or write your own utility. Here's an example of a tiny Python script that does what you're looking for. Note that I only tested it on tiny files, so YMMV.
#!/usr/bin/env python3
import shutil
import os
def concatenate_files(src_filenames: list[str], dst_filename: str) -> None:
with open(dst_filename, 'wb') as outfile:
for infile_name in src_filenames:
with open(infile_name, 'rb') as infile:
shutil.copyfileobj(infile, outfile)
outfile.flush()
os.fsync(outfile.fileno())
if __name__ == "__main__":
# Take command line argument to make it actually useful
input_files = ['file1.dat', 'file2.dat', 'file3.dat']
output_file = 'result.dat'
concatenate_files(input_files, output_file)
Upvotes: 0