zio
zio

Reputation: 2225

Quickest way to split a large file based on text within the file in linux

I have a large file which contains data for 10 years. I want to split it into files that contain 1 year of data each.

The data in the file is in the following format:

GBPUSD,20100201,000200,1.5969,1.5969,1.5967,1.5967,4 GBPUSD,20100201,000300,1.5967,1.5967,1.5960,1.5962,4

Characters 8-11 contain the year. I would like to use that as the filename with .txt on the end. So 2011.txt, 2012.txt etc

The file contains around 4million rows.

I'm using Ubuntu Linux

Upvotes: 6

Views: 2141

Answers (3)

steveha
steveha

Reputation: 76695

It would be best to read through the file once, and write each line to the file where it should go. So the solution by @steve using AWK is a good one.

You could solve this problem using grep and an appropriate regular expression: ^.......2010 would only match lines that have 2010 in the year position. Then a shell script could loop over the years and keep running grep, something like this:

for year in 2010 2011 2012; do
    grep "^.......$year" datafile > $year.txt
done

But it's not elegant because it reads the whole source file once per year.

Here's a Python solution to go along with the AWK one.

import sys

def next_line():
    if len(sys.argv) == 1:
        for line in sys.stdin:
            yield line
    else:
        for name in sys.argv[1:]:
            with open(name) as f:
                for line in f:
                    yield line


_open_files = {}
def output(fname, line):
    if fname not in _open_files:
        _open_files[fname] = open(fname, "w")
    _open_files[fname].write(line)


for line in next_line():
    year = line[7:11]
    fname = year + ".txt"
    output(fname, line)

AWK certainly wins for brevity. I had to implement function next_line() to provide a service that offers up source lines from each file in turn, or standard input if you didn't specify a file; with AWK you get that for free. I had to implement function output() to let you just provide a filename and a string and write the output, but with AWK you get that for free.

If your problem will not ever get more complicated you could use the AWK solution, but if you expect to add more bells and whistles as time goes by, the Python solution might pay off. (That's why I love Python... once you have it working, it's easy to extend it no matter what you need to do.)

Upvotes: 0

twm
twm

Reputation: 1458

I think this should work from the command line:

YEARS=`cat FILE | sed -e 's/^.......//' -e 's/\(....\).*$/\1/' | sort | uniq` ; for Y in $YEARS ; do echo Processing $Y... ; egrep '^.......'$Y FILE > $Y.txt ; done

Upvotes: 0

Steve
Steve

Reputation: 54392

Here's one way using awk:

awk '{ print > substr($0,8,4) ".txt" }' file

If the length of the first field can vary, you may prefer:

awk -F, '{ print > substr($2,0,4) ".txt" }' file

Upvotes: 7

Related Questions