Reputation: 1291
How to split a large csv file (~100GB) and preserve the header in each part ?
For instance
h1 h2
a aa
b bb
into
h1 h2
a aa
and
h1 h2
b bb
Upvotes: 9
Views: 7114
Reputation: 1638
you may download a freeware CsvSplitter from here. It is a zip from the website that contains a simple portable .exe file and a .txt file, necessary to work along with the executable, just extract the content in some directory and you're ready to work:
and it can split the file as can be seen in this picture
Everything is self-explanatory but more details can be found here
Upvotes: 0
Reputation: 2866
I found any previous solutions to this to not work properly on the mac systems that my script was targeting (why Apple? why?) I eventually ended up with a printf option that worked out pretty good as a proof of concept. I'm going to enhance this by putting the temporary files into a ramdisk and the like to improve performance since it is putting a bunch on disk as is and will probably be slow.
#!/bin/sh
# Pass a file in as the first argument on the command line (note, not secure)
file=$1
# Get the header file out
header=$(head -1 $file)
# Separate the data from the header
tail -n +2 $file > output.data
# Split the data into 1000 lines per file (change as you wish)
split -l 1000 output.data output
# Append the header back into each file from split
for part in `ls -1 output*`
do
printf "%s\n%s" "$header" "`cat $part`" > $part
done
Upvotes: 1
Reputation: 24802
First you need to separate the header and the content :
header=$(head -1 $file)
data=$(tail -n +2 $file)
Then you want to split the data
echo $data | split [options...] -
In the options you have to specify the size of the chunks and the pattern for the name of the resulting files. The trailing -
must not be removed as it specifies split
to read data from stdin.
Then you can insert the header at the top of each file
sed -i "1i$header" $splitOutputFile
You should obviously do that last part in a for loop, but its exact code will depend on the prefix chosen for the split
operation.
Upvotes: 5