asdf
asdf

Reputation: 2441

Search and sort data from several files

I have a set of 1000 text files with names in_s1.txt, in_s2.txt and so. Each file contains millions of rows and each row has 7 columns like:

ccc245 1 4 5 5 3 -12.3

For me the most important is the values from the first and seventh columns; the pairs ccc245 , -12.3

What I need to do is to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value, and I also need to get where each value is located, in which file. I need something like:

FILE  1st_col  7th_col

in_s540.txt ccc3456 -9000.5
in_s520.txt ccc488 -723.4
in_s12.txt ccc34 -123.5
in_s344.txt ccc56 -45.6

I was thinking about using python and bash for this purpose but at the moment I did not find a practical approach. All what I know to do is:

  1. concatenate all in_ files in IN.TXT
  2. search the lowest values there using: for i in IN.TXT ; do sort -k6n $i | head -n 10; done
  3. given the 1st_col and 7th_col values of the top ten list, use them to filter the in_s files, using grep -n VALUE in_s*, so I get for each value the name of the file

It works but it is a bit tedious. I wonder about a faster approach only using bash or python or both. Or another better language for this.

Thanks

Upvotes: 2

Views: 1535

Answers (6)

John Machin
John Machin

Reputation: 83002

In python, use the nsmallest function in the heapq module -- it's designed for exactly this kind of task.

Example (tested) for Python 2.5 and 2.6:

import heapq, glob

def my_iterable():
    for fname in glob.glob("in_s*.txt"):
        f = open(fname, "r")
        for line in f:
            items = line.split()
            yield fname, items[0], float(items[6])
        f.close()

result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result

Update after above answer accepted

Looking at the source code for Python 2.6, it appears that there's a possibility that it does list(iterable) and works on that ... if so, that's not going to work with a thousand files each with millions of lines. If the first answer gives you MemoryError etc, here's an alternative which limits the size of the list to n (n == 10 in your case).

Note: 2.6 only; if you need it for 2.5 use a conditional heapreplace() as explained in the docs. Uses heappush() and heappushpop() which don't have the key arg :-( so we have to fake it.

import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp

def my_iterable():
    for fname in glob.glob("in_s*.txt"):
        f = open(fname, "r")
        for line in f:
            items = line.split()
            yield -float(items[6]), fname, items[0]
        f.close()

def homegrown_nlargest(n, iterable):
    """Ensures heap never has more than n entries"""
    heap = []
    for item in iterable:
        if len(heap) < n:
            heappush(heap, item)
        else:
            heappushpop(heap, item)
    return heap

result =  homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)

Upvotes: 3

ghostdog74
ghostdog74

Reputation: 342719

If your files are million lines, you might want to consider using "buffering". the below script goes through those million lines, each time comparing field 7 with those in the buffer. If a value is smaller than those in the buffer, one of them in buffer is replaced by the new lower value.

  for file in in_*.txt
    do
        awk -vt=$t 'NR<=10{
            c=c+1
            val[c]=$7
            tag[c]=$1
        }
        NR>10{
            for(o=1;o<=c;o++){
                if ( $7 <= val[o] ){
                    val[o]=$7
                    tag[o]=$1
                    break
                }
            }
        }
        END{
            for(i=1;i<=c;i++){
                print val[i], tag[i] | "sort"
            }

        }' $file
    done

Upvotes: 0

Sam Doshi
Sam Doshi

Reputation: 4376

Try something like this in python:

min_values = []

def add_to_min(file_name, one, seven):
    # checks to see if 7th column is a lower value than exiting values
    if len(min_values) == 0 or seven < max(min_values)[0]:
        # let's remove the biggest value
        min_values.sort()
        if len(min_values) != 0:
            min_values.pop()
        # and add the new value tuple
        min_values.append((seven, file_name, one))

# loop through all the files
for file_name in os.listdir(<dir>):
    f = open(file_name)
    for line in file_name.readlines():
        columns = line.split()
        add_to_min(file_name, columns[0], float(columns[6]))

# print answers
for (seven, file_name, one) in min_values:
    print file_name, one, seven

Haven't tested it, but it should get you started.

Version 2, just runs the sort a single time (after a prod by S. Lott):

values = []
# loop through all the files and make a long list of all the rows
for file_name in os.listdir(<dir>):
    f = open(file_name)
    for line in file_name.readlines():
        columns = line.split()
        values.append((file_name, columns[0], float(columns[6]))

# sort values, print the 10 smallest
values.sort()
for (seven, file_name, one) in values[:10]
    print file_name, one, seven

Just re-read you question, with millions of rows, you might run out of RAM....

Upvotes: 1

Dennis Williamson
Dennis Williamson

Reputation: 360365

This might be close to what you're looking for:

for file in *; do sort -k6n "$file" | head -n 10 | cut -f1,7 -d " " | sed "s/^/$file /" > "${file}.out"; done

cat *.out | sort -k3n | head -n 10 > final_result.out

Upvotes: 0

lutz
lutz

Reputation:

A small improvement of your shell solution:

$ cat in.txt
in_s1.txt
in_s2.txt
...
$ cat in.txt | while read i
do
  cat $i | sed -e "s/^/$i /" # add filename as first column
done |
sort -n -k8 | head -10 | cut -d" " -f1,2,8

Upvotes: 0

Antony Hatchkins
Antony Hatchkins

Reputation: 34014

I would:

  • take first 10 items,
  • sort them and then
  • for every line read from files insert the element into those top10:
    • in case its value is lower than highest one from current top10,
    • (keeping the sorting for performance)

I wouldn't post the complete program here as it looks like homework.

Yes, if it wasn't ten, this would be not optimal

Upvotes: 2

Related Questions