sflk
sflk

Reputation: 1

Sorting huge files with millions of lines

I have tens of millions of strings in text file like these:

aa kk
bb mm
cc tt
ee ff
aa xx
bb ss
cc gg
ee rr

And I want to make them look like:

aa kk,xx
bb mm,ss
cc tt,gg
ee ff,rr

I have tried to sort and rearrange it with grep, sed and other tools but it looks like it is very slow way on really huge files even with

LC_ALL=C grep something

Upvotes: 0

Views: 1796

Answers (4)

NeronLeVelu
NeronLeVelu

Reputation: 10039

for the performance and memory conservative

sort -u YourFile | awk '{if (Last == $1) {Linked=Linked","$2} else { if (Last != "") print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'

First sort reduce the scope and arrance in order that allow the awk to read line by line and not loading a huge array (due to million of lines you specify) The awk concatene while header is same as previous line and print if not. Add END for last group and a if for first line

maybe a bit faster

sort -u YourFile | awk 'FNR==1{Last=$1;Linked=$2} FNR>1{if (Last == $1) {Linked=Linked","$2} else { print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'

Upvotes: 1

Cyrus
Cyrus

Reputation: 88776

awk '{if(b[$1])b[$1] = b[$1]","; b[$1] = b[$1] $2 $3}; END{for(i in b)print i, b[i]}' file

Output:

aa kk,xx
bb mm,ss
cc tt,gg
ee ff,rr

Source: https://stackoverflow.com/a/26450166/3776858

Upvotes: 1

Jay Kominek
Jay Kominek

Reputation: 8783

I'm not clear if you specifically want to do this with just standard shell tools or not, but, Python is nearly universal on Linux these days. It can be done with a fairly simple program:

#!/usr/bin/python

import sys

data = { }
while True:
    l = sys.stdin.readline()
    if len(l)==0:
        break
    a,b = l.split()
    data.setdefault(a, [ ]).append(b)

for k in sorted(data.keys()):
    vs = data[k]
    print k, ",".join(vs)

I ran it on 50,000,000 lines of data generated by the following C program, and it finishes in about 60 seconds of my years-old laptop:

#include <stdio.h>
#include <stdlib.h>
char letter() { return (rand() % (123-97)) + 97; }
void main(void)
{
  int i;
  for(i=0; i<50000000; i++)
    printf("%c%c%c %c%c%c\n",
           letter(), letter(), letter(),
           letter(), letter(), letter());
}

Upvotes: 1

Ashouri
Ashouri

Reputation: 906

If you have to deal with very large data sets ,I suggest you use Map Reduce pattern.For example Hadoop framework /spark.Take a look at here https://hadoop.apache.org

Upvotes: 0

Related Questions