user1745713
user1745713

Reputation: 791

how to split a file into smaller files (one file per line) [split doesn't work]

I'm trying to split a very large file to one new file per line.

Why? It's going to be input for Mahout. but there are too many lines and not enough suffixes for split.

Is there a way to do this in bash?

Upvotes: 3

Views: 1959

Answers (4)

Ole Tange
Ole Tange

Reputation: 33685

GNU Parallel can do this:

cat big.file | parallel --pipe -N1 'cat > {#}'

But if Mahout can read from stdin then you can avoid the temporary files:

cat big.file | parallel --pipe -N1 mahout --input-file -

Learn more about GNU Parallel https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 and walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Upvotes: 1

glenn jackman
glenn jackman

Reputation: 246774

Here's another way to do something for each line:

while IFS= read -r line; do
    do_something_with "$line"
done < big.file

Upvotes: 1

Todd A. Jacobs
Todd A. Jacobs

Reputation: 84343

Increase Your Suffix Length with Split

If you insist on using split, then you have to increase your suffix length. For example, assuming you have 10,000 lines in your file:

split --suffix-length=5 --lines=1 foo.txt

If you really want to go nuts with this approach, you can even set the suffix length dynamically with the wc command and some shell arithmetic. For example:

file='foo.txt'
split \
    --suffix-length=$(( $(wc --chars < <(wc --lines < "$file")) - 1 )) \
    --lines=1 \
    "$file"

Use Xargs Instead

However, the above is really just a kludge anyway. A more correct solution would be to use xargs from the GNU findutils package to invoke some command once per line. For example:

xargs --max-lines=1 --arg-file=foo.txt your_command

This will pass one line at a time to your command. This is a much more flexible approach and will dramatically reduce your disk I/O.

Upvotes: 4

Ross Presser
Ross Presser

Reputation: 6255

split --lines=1 --suffix-length=5 input.txt output.

This will use 5 characters per suffix, which is enough for 265 = 11881376 files. If you really have more than that, increase suffix-length.

Upvotes: 2

Related Questions