Pol
Pol

Reputation: 9

Remove and keep entire row in a file if a column is repeated

I'm trying to do something like this

Input file

123  09
123  10
355  07
765  01
765  03
765  05

Output file 1

123 09
355 07
765 01

Output file 2

123 10
765 03
765 05

I mean. I want two eliminate (entire row) if there are repeated values in column 1, but actually I want to put this values in another file.

I know that I can obtain output 1 with

awk '!a[$1]++' file 

But is it possible to obtain output 2???

I'm open to python scripts.

Upvotes: 0

Views: 116

Answers (6)

alec_djinn
alec_djinn

Reputation: 10799

This an easy and readable python script that will do the job. If you have any question please comment.

# open all the files
with open('output_1.txt','w') as out_1:
    with open('output_2.txt', 'w') as out_2:
        with open('input.txt', 'r') as f:
            #make list that stores intermediate results
            tmp = []
            #iterate over each row of the input file
            for row in f:
                #extract the data contained in the row
                col_1, col_2 = row.split('  ') #split the line at double space

                #check if you have met col_1 before
                #if not, write the row in output_1
                if col_1 not in tmp:
                    tmp.append(col_1)
                    out_1.write(row)
                #otherwise write the row in output_2
                else:
                    out_2.write(row)

Upvotes: 1

John Bollinger
John Bollinger

Reputation: 180418

You can do this job directly in bash. For example:

#!/bin/bash

# file names
file=input.in
dupes=dupes.out
uniques=uniques.out

# an (associative) array to track seen keys
declare -a keys

# extracts a key from an input line via shell word splitting
get_key() {
  key=$1
}

# Removes old output files
[ -e "$dupes" ] && rm "$dupes"
[ -e "$uniques" ] && rm "$uniques"

# process the input line by line
while read line; do
  get_key $line
  if [ -n "${keys[$key]}" ]; then
    # a duplicate
    echo "$line" >> "$dupes"
  else
    # not a duplicate
    keys[$key]=1
    echo "$line" >> "$uniques"
  fi  
done < "$file"

It could be shortened in a variety of ways; I wrote for clarity and a bit for flexibility at the expense of brevity.

It's anyway important to understand that bash is a pretty powerful programming environment in its own right. One of the things that slows many shell scripts is use of a lot of external commands. Using external commands is not inherently bad, and sometimes it's the best or only way to get the job done, but where that's not the case you should give serious consideration to avoiding them.

Upvotes: -1

123
123

Reputation: 11216

One way with awk

awk '{print >("file"(!a[$1]++?1:2))}' file

or

awk '{print >("file"(a[$1]++?2:1))}' file

Upvotes: 3

Jose Ricardo Bustos M.
Jose Ricardo Bustos M.

Reputation: 8164

try

awk '{if($1 in a){ print > "Output2" }else{ print > "Output1"} a[$1]=true}' input

you get in Output1 file

123  09
355  07
765  01

you get in Output2 file

123  10
765  03
765  05

if, you want get only output2 then remove ! in your code

awk 'a[$1]++' input

Upvotes: 0

anubhava
anubhava

Reputation: 785286

For both the 1st and the 2nd outputs you can use this awk command:

awk '!seen[$1]++{print > "output1"; next} {print > "output2"}' file

cat output1
123  09
355  07
765  01

cat output2
123  10
765  03
765  05

Upvotes: 0

Mike M&#252;ller
Mike M&#252;ller

Reputation: 85482

With Python:

seen = set()
with open('data.txt') as fin, open('f1.txt', 'w') as fout1, open('f2.txt', 'w') as fout2:
    for line in fin:
        col = line.split()[0]
        if col in seen:
            fout2.write(line)
        else:
            seen.add(col)
            fout1.write(line)

Upvotes: 0

Related Questions