rpmartz
rpmartz

Reputation: 3809

Odd Python dictionary and string behavior

I'm working on an assignment for a course that involves computing bigram letter pair frequencies. We're to implement it ourselves and not use any of the many libraries that have robust versions of this implemented in.

The assignment is straightforward, but in constructing my model, I'm seeing some very odd behavior when trying to iterate over the keys and I've got a Python question. I'm simply splitting the text into a list of characters, and then storing each bigram in a dict with its frequency. So the dict is something like { 'aa': 7, 'ab' : 9, ... }. Easy enough, I thought.

Trying to iterate over the dict to explore the data, I'm using a simple for loop like:

for k in frequencies:
    print 'bigram: %s frequency: %s' % (k, frequencies[k])

This works fine for most of the bigrams but sprinkled throughout the output there are lines with very odd output like this:

bigram: Ab frequency: 14
bigram: e; frequency: 29
frequency: 4
bigram: l? frequency: 4
bigram: -p frequency: 1
A frequency: 36

As you can see, there are a number of lines where the entire formatted string is not being printed.

I tried debugging this by printing out each letter of the bigram as I was constructing them, like so:

print 'letter one:  |' + first_letter + '| letter two: `' + second_letter + '`'

This results in the same odd output for a few of the lines, where the first part of my output string is ignored:

letter one:  |t| letter two: `.`
`
| letter two: `T`
letter one:  |T| letter two: `h`

Doing this, I noticed that it seems to be . characters causing these problems in some, but not all of the cases, so I modified the bigram parser to skip bigrams containing non alphanumeric characters, but got the same issues. It would seem that some_dict['.T'] should be fine, the key is hashable, etc.

My question: why is the output (seemingly) being mangled? What could be causing these format strings to have their first parts ignored?

Using Python 2.7.5, if that matters. Output is identical on Mac OS X and Ubuntu 12.04.

Upvotes: 0

Views: 144

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121196

You have control characters in your bigrams that either clear the line, or return the print position to the start of the line (usually ASCII codepoint 0x0D, \r, CARRIAGE RETURN).

Use %r to print a string literal representation instead, where control characters are replaced by their python string escape codes instead:

for k in frequencies:
    print 'bigram: %r frequency: %s' % (k, frequencies[k])

As a side note, you may want to look at collections.Counter() for collecting bigram frequencies; it is a subclass of dict that adds several niceties such as counting frequencies for you and a method for listing the most common elements (in sorted order).

Upvotes: 3

Related Questions