Reputation: 791
I'm trying to split a very large file to one new file per line.
Why? It's going to be input for Mahout. but there are too many lines and not enough suffixes for split.
Is there a way to do this in bash?
Upvotes: 3
Views: 1959
Reputation: 33685
GNU Parallel can do this:
cat big.file | parallel --pipe -N1 'cat > {#}'
But if Mahout can read from stdin then you can avoid the temporary files:
cat big.file | parallel --pipe -N1 mahout --input-file -
Learn more about GNU Parallel https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 and walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Upvotes: 1
Reputation: 246774
Here's another way to do something for each line:
while IFS= read -r line; do
do_something_with "$line"
done < big.file
Upvotes: 1
Reputation: 84343
If you insist on using split, then you have to increase your suffix length. For example, assuming you have 10,000 lines in your file:
split --suffix-length=5 --lines=1 foo.txt
If you really want to go nuts with this approach, you can even set the suffix length dynamically with the wc command and some shell arithmetic. For example:
file='foo.txt'
split \
--suffix-length=$(( $(wc --chars < <(wc --lines < "$file")) - 1 )) \
--lines=1 \
"$file"
However, the above is really just a kludge anyway. A more correct solution would be to use xargs from the GNU findutils package to invoke some command once per line. For example:
xargs --max-lines=1 --arg-file=foo.txt your_command
This will pass one line at a time to your command. This is a much more flexible approach and will dramatically reduce your disk I/O.
Upvotes: 4
Reputation: 6255
split --lines=1 --suffix-length=5 input.txt output.
This will use 5 characters per suffix, which is enough for 265 = 11881376 files. If you really have more than that, increase suffix-length.
Upvotes: 2