Reputation: 1
I have tens of millions of strings in text file like these:
aa kk bb mm cc tt ee ff aa xx bb ss cc gg ee rr
And I want to make them look like:
aa kk,xx bb mm,ss cc tt,gg ee ff,rr
I have tried to sort and rearrange it with grep, sed and other tools but it looks like it is very slow way on really huge files even with
LC_ALL=C grep something
Upvotes: 0
Views: 1796
Reputation: 10039
for the performance and memory conservative
sort -u YourFile | awk '{if (Last == $1) {Linked=Linked","$2} else { if (Last != "") print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'
First sort reduce the scope and arrance in order that allow the awk to read line by line and not loading a huge array (due to million of lines you specify) The awk concatene while header is same as previous line and print if not. Add END for last group and a if for first line
maybe a bit faster
sort -u YourFile | awk 'FNR==1{Last=$1;Linked=$2} FNR>1{if (Last == $1) {Linked=Linked","$2} else { print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'
Upvotes: 1
Reputation: 88776
awk '{if(b[$1])b[$1] = b[$1]","; b[$1] = b[$1] $2 $3}; END{for(i in b)print i, b[i]}' file
Output:
aa kk,xx bb mm,ss cc tt,gg ee ff,rr
Source: https://stackoverflow.com/a/26450166/3776858
Upvotes: 1
Reputation: 8783
I'm not clear if you specifically want to do this with just standard shell tools or not, but, Python is nearly universal on Linux these days. It can be done with a fairly simple program:
#!/usr/bin/python
import sys
data = { }
while True:
l = sys.stdin.readline()
if len(l)==0:
break
a,b = l.split()
data.setdefault(a, [ ]).append(b)
for k in sorted(data.keys()):
vs = data[k]
print k, ",".join(vs)
I ran it on 50,000,000 lines of data generated by the following C program, and it finishes in about 60 seconds of my years-old laptop:
#include <stdio.h>
#include <stdlib.h>
char letter() { return (rand() % (123-97)) + 97; }
void main(void)
{
int i;
for(i=0; i<50000000; i++)
printf("%c%c%c %c%c%c\n",
letter(), letter(), letter(),
letter(), letter(), letter());
}
Upvotes: 1
Reputation: 906
If you have to deal with very large data sets ,I suggest you use Map Reduce pattern.For example Hadoop framework /spark.Take a look at here https://hadoop.apache.org
Upvotes: 0