echo
echo

Reputation: 1291

Split large csv file and keep header in each part

How to split a large csv file (~100GB) and preserve the header in each part ?

For instance

h1 h2
a  aa
b  bb

into

h1 h2
a  aa

and

h1 h2
b  bb

Upvotes: 9

Views: 7114

Answers (3)

Shaina Raza
Shaina Raza

Reputation: 1638

you may download a freeware CsvSplitter from here. It is a zip from the website that contains a simple portable .exe file and a .txt file, necessary to work along with the executable, just extract the content in some directory and you're ready to work:

enter image description here and it can split the file as can be seen in this picture enter image description here

Everything is self-explanatory but more details can be found here

Upvotes: 0

Josiah
Josiah

Reputation: 2866

I found any previous solutions to this to not work properly on the mac systems that my script was targeting (why Apple? why?) I eventually ended up with a printf option that worked out pretty good as a proof of concept. I'm going to enhance this by putting the temporary files into a ramdisk and the like to improve performance since it is putting a bunch on disk as is and will probably be slow.

#!/bin/sh

# Pass a file in as the first argument on the command line (note, not secure)
file=$1

# Get the header file out
header=$(head -1 $file)

# Separate the data from the header
tail -n +2 $file > output.data

# Split the data into 1000 lines per file (change as you wish)
split -l 1000 output.data output

# Append the header back into each file from split 
for part in `ls -1 output*`
do
  printf "%s\n%s" "$header" "`cat $part`" > $part
done

Upvotes: 1

Aaron
Aaron

Reputation: 24802

First you need to separate the header and the content :

header=$(head -1 $file)
data=$(tail -n +2 $file)

Then you want to split the data

echo $data | split [options...] -

In the options you have to specify the size of the chunks and the pattern for the name of the resulting files. The trailing - must not be removed as it specifies split to read data from stdin.

Then you can insert the header at the top of each file

sed -i "1i$header" $splitOutputFile

You should obviously do that last part in a for loop, but its exact code will depend on the prefix chosen for the split operation.

Upvotes: 5

Related Questions