Reputation: 9
I'm trying to do something like this
Input file
123 09
123 10
355 07
765 01
765 03
765 05
Output file 1
123 09
355 07
765 01
Output file 2
123 10
765 03
765 05
I mean. I want two eliminate (entire row) if there are repeated values in column 1, but actually I want to put this values in another file.
I know that I can obtain output 1 with
awk '!a[$1]++' file
But is it possible to obtain output 2???
I'm open to python scripts.
Upvotes: 0
Views: 116
Reputation: 10799
This an easy and readable python script that will do the job. If you have any question please comment.
# open all the files
with open('output_1.txt','w') as out_1:
with open('output_2.txt', 'w') as out_2:
with open('input.txt', 'r') as f:
#make list that stores intermediate results
tmp = []
#iterate over each row of the input file
for row in f:
#extract the data contained in the row
col_1, col_2 = row.split(' ') #split the line at double space
#check if you have met col_1 before
#if not, write the row in output_1
if col_1 not in tmp:
tmp.append(col_1)
out_1.write(row)
#otherwise write the row in output_2
else:
out_2.write(row)
Upvotes: 1
Reputation: 180418
You can do this job directly in bash
. For example:
#!/bin/bash
# file names
file=input.in
dupes=dupes.out
uniques=uniques.out
# an (associative) array to track seen keys
declare -a keys
# extracts a key from an input line via shell word splitting
get_key() {
key=$1
}
# Removes old output files
[ -e "$dupes" ] && rm "$dupes"
[ -e "$uniques" ] && rm "$uniques"
# process the input line by line
while read line; do
get_key $line
if [ -n "${keys[$key]}" ]; then
# a duplicate
echo "$line" >> "$dupes"
else
# not a duplicate
keys[$key]=1
echo "$line" >> "$uniques"
fi
done < "$file"
It could be shortened in a variety of ways; I wrote for clarity and a bit for flexibility at the expense of brevity.
It's anyway important to understand that bash
is a pretty powerful programming environment in its own right. One of the things that slows many shell scripts is use of a lot of external commands. Using external commands is not inherently bad, and sometimes it's the best or only way to get the job done, but where that's not the case you should give serious consideration to avoiding them.
Upvotes: -1
Reputation: 11216
One way with awk
awk '{print >("file"(!a[$1]++?1:2))}' file
or
awk '{print >("file"(a[$1]++?2:1))}' file
Upvotes: 3
Reputation: 8164
try
awk '{if($1 in a){ print > "Output2" }else{ print > "Output1"} a[$1]=true}' input
you get in Output1 file
123 09 355 07 765 01
you get in Output2 file
123 10 765 03 765 05
if, you want get only output2 then remove !
in your code
awk 'a[$1]++' input
Upvotes: 0
Reputation: 785286
For both the 1st and the 2nd outputs you can use this awk command:
awk '!seen[$1]++{print > "output1"; next} {print > "output2"}' file
cat output1
123 09
355 07
765 01
cat output2
123 10
765 03
765 05
Upvotes: 0
Reputation: 85482
With Python:
seen = set()
with open('data.txt') as fin, open('f1.txt', 'w') as fout1, open('f2.txt', 'w') as fout2:
for line in fin:
col = line.split()[0]
if col in seen:
fout2.write(line)
else:
seen.add(col)
fout1.write(line)
Upvotes: 0