user11035198
user11035198

Reputation:

Run sed for subset of arguments at time

I am currently running sed in a python subprocess, however I am receiving the error:

"OSError: [Errno 7] Argument list too long: 'sed'"

The Python code is:

subprocess.run(['sed', '-i',
                '-e', 's/#/pau/g',
                *glob.glob('label_POS/label_phone_align/dump/*')], check=True)

Where the /dump/ directory has ~13,000 files in it. I have been told that I need to run the command for subsets of the argument list, but I'm can't find how to do that.

Upvotes: 1

Views: 245

Answers (2)

tripleee
tripleee

Reputation: 189467

Please scroll down to the end of this answer for the solution I recommend for your specific problem. There's a bit of background here for context and/or future visitors grappling with other "argument list too long" errors.

The exec() system call has a size limit; you cannot pass more than ARG_MAX bytes as arguments to a process, where this system constant's value can usually be queried with the getconf ARG_MAX command on modern systems.

import glob
import subprocess

arg_max = subprocess.run(['getconf', 'ARG_MAX'],
    text=True, check=True, capture_output=True
    ).stdout.strip()
arg_max = int(arg_max)

cmd = ['sed', '-i', '-e', 's/#/pau/g']
files = glob.glob('label_POS/label_phone_align/dump/*')
while files:
    base = sum(len(x) for x in cmd) + len(cmd)
    for l in range(len(files)):
        base += 1 + len(files[l])
        if base > arg_max:
            l -= 1
            break
    subprocess.run(cmd + files[0:l+1], check=True)
    files = files[l+1:]

Of course, the xargs command already does exactly this for you.

import subprocess
import glob

subprocess.run(
    ['xargs', '-r', '-0', 'sed', '-i', '-e', 's/#/pau/g'],
    input=b'\0'.join([x.encode() for x in glob.glob('label_POS/label_phone_align/dump/*') + ['']]),
    check=True)

Simply removing the long path might be enough in you case, though. You are repeating label_POS/label_phone_align/dump/ in front of every file name in the argument array.

import glob
import subprocess
import os

path = 'label_POS/label_phone_align/dump'
files = [os.path.basename(file)
    for file in glob.glob(os.path.join(path, '*'))]
subprocess.run(
    ['sed', '-i', '-e', 's/#/pau/g', *files],
    cwd=path, check=True)

Eventually, perhaps prefer a pure Python solution.

import glob
import fileinput

for line in fileinput.input(glob.glob('label_POS/label_phone_align/dump/*'), inplace=True):
    print(line.replace('#', 'pau'))

Upvotes: 0

Green Cloak Guy
Green Cloak Guy

Reputation: 24691

Whoever told you that probably meant that you need to split up the glob and run multiple separate commands:

files = glob.glob('label_POS/label_phone_align/dump/*')
i = 0
scale = 100
# process in units of 100 filenames until we have them all
while scale*i < len(files):
    subprocess.run(['sed', '-i',
            '-e', 's/#/pau/g',
            *files[scale*i:scale*(i+1)]], check=True)
    i += 1

and then amalgamate all that output however you need, after the fact. I don't know how many inputs the sed command can accept from the command line, but it's apparently less than 13,000. You can keep changing scale until it doesn't error.

Upvotes: 1

Related Questions