dgr379
dgr379

Reputation: 345

Counting specific characters in a file (Python)

I'd like to count specific things from a file, i.e. how many times "--undefined--" appears. Here is a piece of the file's content:

"jo:ns  76.434
pRE     75.417
zi:     75.178
dEnt    --undefined--
ba      --undefined--

I tried to use something like this. But it won't work:

with open("v3.txt", 'r') as infile:
    data = infile.readlines().decode("UTF-8")

    count = 0
    for i in data:
        if i.endswith("--undefined--"):
            count += 1
    print count

Do I have to implement, say, dictionary of tuples to tackle this or there is an easier solution for that?

EDIT:

The word in question appears only once in a line.

Upvotes: 1

Views: 5276

Answers (5)

accdias
accdias

Reputation: 5372

Quoting Raymond Hettinger, "There must be a better way":

from collections import Counter

counter = Counter()
words = ('--undefined--', 'otherword', 'onemore')

with open("v3.txt", 'r') as f:
    lines = f.readlines()
    for line in lines:
        for word in words:
            if word in line:
                counter.update((word,))  # note the single element tuple

print counter

Upvotes: 1

bruno desthuilliers
bruno desthuilliers

Reputation: 77902

When reading a file line by line, each line ends with the newline character:

>>> with open("blookcore/models.py") as f:
...    lines = f.readlines()
... 
>>> lines[0]
'# -*- coding: utf-8 -*-\n'
>>> 

so your endswith() test just can't work - you have to strip the line first:

if i.strip().endswith("--undefined--"):
    count += 1

Now reading a whole file in memory is more often than not a bad idea - even if the file fits in memory, it still eats fresources for no good reason. Python's file objects are iterable, so you can just loop over your file. And finally, you can specify which encoding should be used when opening the file (instead of decoding manually) using the codecs module (python 2) or directly (python3):

# py3
with open("your/file.text", encoding="utf-8") as f:

# py2:
import codecs
with codecs.open("your/file.text", encoding="utf-8") as f:

then just use the builtin sum and a generator expression:

result = sum(line.strip().endswith("whatever") for line in f)

this relies on the fact that booleans are integers with values 0 (False) and 1 (True).

Upvotes: 1

Espoir Murhabazi
Espoir Murhabazi

Reputation: 6376

you can read all the data in one string and split the string in a list, and count occurrences of the substring in that list.

with open('afile.txt', 'r') as myfile:
    data=myfile.read().replace('\n', ' ')

data.split(' ').count("--undefined--")

or directly from the string :

data.count("--undefined--")

Upvotes: 3

aylr
aylr

Reputation: 369

Or don't limit yourself to .endswith(), use the in operator.

data = ''
count = 0

with open('v3.txt', 'r') as infile:
    data = infile.readlines()
print(data)

for line in data:
    if '--undefined--' in line:
        count += 1

count

Upvotes: 1

Honza Zíka
Honza Zíka

Reputation: 483

readlines() returns the list of lines, but they are not stripped (ie. they contain the newline character). Either strip them first:

data = [line.strip() for line in data]

or check for --undefined--\n:

if line.endswith("--undefined--\n"):

Alternatively, consider string's .count() method:

file_contents.count("--undefined--")

Upvotes: 1

Related Questions