Reputation: 1191
I'm learning Python and created this program, but it won't work and I'm hoping someone can find the error!
I have a file that has entries like this:
0 Kurthia sibirica Planococcaceae
1593 Lactobacillus hordei Lactobacillaceae
1121 Lactobacillus coleohominis Lactobacillaceae
614 Lactobacillus coryniformis Lactobacillaceae
57 Lactobacillus kitasatonis Lactobacillaceae
3909 Lactobacillus malefermentans Lactobacillaceae
My goal is to remove all the lines that start with a number that only occurs once in the whole file (unique numbers), and save all the lines that start with number occurring twice or more to a new file. This is my attempt. It doesn't work yet (when I let the print
line work, one line from the whole files repeated 3 times and that's it):
#!/usr/bin/env python
infilename = 'v35.clusternum.species.txt'
outfilename = 'v13clusters.no.singletons.txt'
#remove extra letters and spaces
x = 0
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
for line in infile:
clu, gen, spec, fam = line.split()
for clu in line:
if clu.count > 1:
#print line
outfile.write(line)
else:
x += 1
print("Number of Singletons:")
print(x)
Thanks for any help!
Upvotes: 0
Views: 115
Reputation: 25974
Okay, your code is kind of headed in the right direction, but you have a few things decidedly confused.
You need to separate what your script is doing into two logical steps: one, aggregating (counting) all of the clu
fields. Two, writing each field that has a clu
count of > 1. You tried to do these steps together at the same time and.. well, it didn't work. You can technically do it that way, but you have the syntax wrong. It's also terribly inefficient to continuously search through your file for stuff. Best to only do it once or twice.
So, let's separate the steps. First, count up your clu
fields. The collections
module has a Counter
that you can use.
from collections import Counter
with open(infilename, 'r') as infile:
c = Counter(line.split()[0] for line in infile)
c
is now a Counter
that you can use to look up the count of a given clu
.
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
for line in infile:
clu, gen, spec, fam = line.split()
if c[clu] > 1:
outfile.write(line)
Upvotes: 2