Joshua1729
Joshua1729

Reputation: 795

Join lines based on a starting value using UNIX commands

Here I am again, with another UNIX requirement (as my knowledge in UNIX is limited to basic commands).

I have a file that looks like this (and has about 30 million lines)

123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
123456789012,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
123456789012,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0

The final output should be like this (without the first value repeating in the joined portions)

123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0

However, if the above output is a bit complicated, an output like below is also fine. Because I can load the file into Oracle11g and get rid of the redundant columns.

123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,123456789012,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,123456789012,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,234567890123,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,345678901234,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,345678901234,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,567890123456,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0

Upvotes: 0

Views: 160

Answers (5)

beroe
beroe

Reputation: 12326

Here are some Python options, if you decide to go that route... First will work for multiple input files and non-sequential identical indices. Second doesn't read the whole file into memory.

(Note, I know it is not convention, but I intentionally use UpperCase for variables to make it clear what is a user-defined variable and what is a special python word.)

#! /usr/bin/env python
# -*- coding: utf-8 -*-

"""
concatenate comma-separated values based on first value

Usage: 
   catfile.py *.txt > output.dat

"""
import sys

if len(sys.argv)<2:
    sys.stderr.write(__doc__)
else:
    FileList = sys.argv[1:]
    IndexList = []
    OutDict = {}
    for FileName in FileList:
        with open(FileName,'rU') as FStream:
            for Line in FStream:
                if Line:
                    Ind,TheRest = Line.rstrip().split(",",1)
                    if Ind not in IndexList:
                        IndexList.append(Ind)
                    OutDict[Ind] = OutDict.get(Ind,"") + "," + TheRest

    for Ind in IndexList:
        print Ind + OutDict[Ind]

Here is a different version which doesn't load the whole file into memory, but requires that the identical Indices all occur in order, and it only runs on one file:

#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""
concatenate comma-separated values based on first value

Usage: 
   catfile.py *.txt > output.dat

"""
import sys

if len(sys.argv)<2:
    sys.stderr.write(__doc__)
else:
    FileName = sys.argv[1]
    OutString = ''
    PrevInd = ''
    FirstLine = True
    with open(FileName,'rU') as FStream:
        for Line in FStream:
            if "," in Line:
                Ind,TheRest = Line.rstrip().split(",",1)
                if Ind != PrevInd:
                    if not FirstLine:
                        print PrevInd+OutString
                    PrevInd = Ind
                    OutString = TheRest
                    FirstLine = False
                else:
                    OutString += ","+TheRest
        print Ind + OutString

More generally, you can run these with by saving them as say catfile.py and then doing python catfile.py inputfile.txt > outputfile.txt. Or for longer term solutions, make a scripts directory, add it to your $PATH, make them executable with chmod u+x catfile.py and then you can just type the name of the script from any directory. But that is another topic that you would want to research.

Upvotes: 1

potong
potong

Reputation: 58578

This might work for you (GNU sed):

sort file | sed -r ':a;$!N;s/^(([^,]*),.*)\n\2/\1/;ta;P;D'

Sort the file (if need be) and then delete newline and key where duplicates appear.

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89639

A way without array:

BEGIN { FS = OFS = "," ; ORS = "" }
{
    if (lid == $1) { $1 = "" ; print $0 }
    else { print sep $0 ; lid = $1 ; sep = "\n" }
}
END { if (NR) print }

Note: if you don't need a newline at the end, remove the END block.

Upvotes: 0

Jonathan Leffler
Jonathan Leffler

Reputation: 755094

Using awk is sufficient; it is a control-break report of sorts. Since the lines with the same key are grouped together — a very important point — it is fairly simple.

awk -F, '{   if ($1 != saved)
             {
                 if (saved != 0) print saved "," list
                 saved = $1
                 list = ""
             }
             pad = ""
             for (i = 2; i <= NF; i++) { list = list pad $i; pad = "," }
         }
         END { if (saved != 0) print saved, list }'

You can feed the data as standard input or list the files to be processed after the final single quote.

Sample output:

123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456 PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0

The code uses saved to keep a track of the key column value that it is accumulating. When the key column changes, print out the saved values (if there are any) and reset for the new set of lines. At the end, print out the saved values (if there are any). The code deals with an empty file gracefully, therefore.


Perl options

#!/usr/bin/env perl
use strict;
use warnings;
my $saved = "";
my $list;
while (<>)
{
    chomp;
    my($key,$value) = ($_ =~ m/^([^,]+)(,.*)/);
    if ($key ne $saved)
    {
        print "$saved$list\n" if $saved;
        $saved = $key;
        $list = "";
    }
    $list .= $value;
}
print "$saved$list\n" if $saved;

Or, if you really want to, you can saved writing the loop (and using strict and warnings) with:

perl -n -e 'chomp;
($key,$value) = ($_ =~ m/^([^,]+)(,.*)/);
if ($key ne $saved)
{
    print "$saved$list\n" if $saved;
    $saved = $key;
    $list = "";
}
$list .= $value;
} END {
print "$saved$list\n" if $saved;'

That could be squished down to a single (rather long) line. The } END { is a piece of Perl weirdness; the -n option creates a loop while (<>) { … } and interpolates the script in the -e argument into it, so the } in } END { terminates that loop and then creates an END block which is ended by the } that Perl provided. Yes, documented and supported; yes, extremely weird (so I wouldn't do it; I'd use the Perl script shown first).

Upvotes: 2

Tom Fenech
Tom Fenech

Reputation: 74705

This awk script does what you want:

BEGIN { FS = OFS = "," }
NR == 1 { a[++n] = $1 }
a[1] != $1 { for(i=1; i<=n; ++i) printf "%s%s", a[i], (i<n?OFS:ORS); n = 1 }
{ a[1] = $1; for(i=2;i<=NF;++i) a[++n] = $i }
END { for(i=1; i<=n; ++i) printf "%s%s", a[i], (i<n?OFS:ORS) }

It stores all of the fields with the same first column in an array. When the first column differs, it prints out all of the elements of the array. Use it like awk -f join.awk file.

Output:

123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0

Upvotes: 2

Related Questions