Reputation: 9

Remove and keep entire row in a file if a column is repeated

I'm trying to do something like this

Input file

Output file 1

123 09
355 07
765 01

Output file 2

123 10
765 03
765 05

I mean. I want two eliminate (entire row) if there are repeated values in column 1, but actually I want to put this values in another file.

I know that I can obtain output 1 with

awk '!a[$1]++' file

But is it possible to obtain output 2???

I'm open to python scripts.

Upvotes: 0

Answers (6)

alec_djinn

Reputation: 10819

This an easy and readable python script that will do the job. If you have any question please comment.

# open all the files
with open('output_1.txt','w') as out_1:
    with open('output_2.txt', 'w') as out_2:
        with open('input.txt', 'r') as f:
            #make list that stores intermediate results
            tmp = []
            #iterate over each row of the input file
            for row in f:
                #extract the data contained in the row
                col_1, col_2 = row.split('  ') #split the line at double space

                #check if you have met col_1 before
                #if not, write the row in output_1
                if col_1 not in tmp:
                    tmp.append(col_1)
                    out_1.write(row)
                #otherwise write the row in output_2
                else:
                    out_2.write(row)

Upvotes: 1

John Bollinger

Reputation: 181907

You can do this job directly in bash. For example:

#!/bin/bash

# file names
file=input.in
dupes=dupes.out
uniques=uniques.out

# an (associative) array to track seen keys
declare -a keys

# extracts a key from an input line via shell word splitting
get_key() {
  key=$1
}

# Removes old output files
[ -e "$dupes" ] && rm "$dupes"
[ -e "$uniques" ] && rm "$uniques"

# process the input line by line
while read line; do
  get_key $line
  if [ -n "${keys[$key]}" ]; then
    # a duplicate
    echo "$line" >> "$dupes"
  else
    # not a duplicate
    keys[$key]=1
    echo "$line" >> "$uniques"
  fi  
done < "$file"

It could be shortened in a variety of ways; I wrote for clarity and a bit for flexibility at the expense of brevity.

It's anyway important to understand that bash is a pretty powerful programming environment in its own right. One of the things that slows many shell scripts is use of a lot of external commands. Using external commands is not inherently bad, and sometimes it's the best or only way to get the job done, but where that's not the case you should give serious consideration to avoiding them.

Upvotes: -1

123

Reputation: 11246

One way with awk

awk '{print >("file"(!a[$1]++?1:2))}' file

awk '{print >("file"(a[$1]++?2:1))}' file

Upvotes: 3

Jose Ricardo Bustos M.

Reputation: 8174

try

awk '{if($1 in a){ print > "Output2" }else{ print > "Output1"} a[$1]=true}' input

you get in Output1 file

123  09
355  07
765  01

you get in Output2 file

123  10
765  03
765  05

if, you want get only output2 then remove ! in your code

awk 'a[$1]++' input

Upvotes: 0

anubhava

Reputation: 786329

For both the 1st and the 2nd outputs you can use this awk command:

awk '!seen[$1]++{print > "output1"; next} {print > "output2"}' file

cat output1
123  09
355  07
765  01

cat output2
123  10
765  03
765  05

Upvotes: 0

Mike Müller

Reputation: 85622

With Python:

seen = set()
with open('data.txt') as fin, open('f1.txt', 'w') as fout1, open('f2.txt', 'w') as fout2:
    for line in fin:
        col = line.split()[0]
        if col in seen:
            fout2.write(line)
        else:
            seen.add(col)
            fout1.write(line)

Upvotes: 0

Remove and keep entire row in a file if a column is repeated

Answers (6)

Related Questions