Unexpected whitespace in python generated strings

Question

I am using Python to generate an ASCII file composed of very long lines. This is one example line (let's say line 100 in the file, '[...]' are added by me to shorten the line):

{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,479 8,485 1,[...]}

If I open the ASCII file that I generated with ipython:

f = open('myfile','r')
print repr(f.readlines()[99])

I do obtain the expected line printed correctly ('[...]' are added by me to shorten the line):

'{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,479 8,485 1,[...]}
'

On the contrary, if I open this file with the program that is suppose to read it, it will generate an exception, complaining about an unexpected pair after 478 1. So I tried to open the file with vim. Still vim shows no problem, but if I copy the line as printed by vim and paste it in another text editor (in my case TextMate), this is the line that I obtain ('[...]' are added by me to shorten the line):

{6 1,14 1,[...],264 1,270      2,274 2,[...],478 1,4     79 8,485 1,[...]}

This line indeed has a problem after the pair 478 1. I tried to generate my lines in different ways (concatenating, with cStringIO, ...), but I always obtain this result. When using the cStringIO, for example, the lines are generated as in the following (even though I tried to change this, as well, with no luck):

def _construct_arff(self,attributes,header,data_rows):
  """Create the string representation of a Weka ARFF file.
     *attributes* is a dictionary with attribute_name:attribute_type
       (e.g., 'num_of_days':'NUMERIC')
     *header* is a list of the attributes sorted
       (e.g., ['age','name','num_of_days'])
     *data_rows* is a list of lists with the values, sorted as in the header
       (e.g., [ [88,'John',465],[77,'Bob',223]]"""

  arff_str = cStringIO.StringIO()
  arff_str.write('@relation %s
' % self.relation_name)

  for idx,att_name in enumerate(header):
    try:
      name = att_name.replace("\","\\").replace("'","\'")
      arff_str.write("@attribute '%s' %s
" % (name,attributes[att_name]))
    except UnicodeEncodeError:
      arff_str.write('@attribute unicode_err_%s %s
' 
                     % (idx,attributes[att_name]))

  arff_str.write('@data
')
  for data_row in data_rows:
    row = []
    for att_idx,att_name in enumerate(header):
      att_type = attributes[att_name]
      value = data_row[att_idx]
      # numeric attributes can be sparse: None and zeros are not written
      if ((not att_type == constants.ARRF_NUMERIC)
          or not ((value == None) or value == 0)):
        row.append('%s %s' % (att_idx,value))
    arff_str.write('{' + (','.join(row)) + '}
')
  return arff_str.getvalue()

UPDATE: As you can see from the code above, the function transforms a given set of data to a special arff file format. I noticed that one of the attributes I was creating contained numbers as strings (e.g., '1', instead of 1). By forcing these numbers into integers:

features[name] = int(value)

I recreated the arff file successfully. However I don't see how this, which is a value, can have an impact on the formatting of *att_idx*, which is always an integer, as also pointed out by @JohnMachin and @gnibbler (thanks for your answers, btw). So, even if my code runs now, I still don't see why this happens. How can the value, if not properly transformed into int, influence the formatting of something else?

This file contains the wrongly formatted version.

John Machin · Accepted Answer

The built-in function repr is your friend. It will show you unambiguously what you have in your file.

Do this:

f = open('myfile','r')
print repr(f.readlines()[99])

and edit your question to show the result.

Update: As to how it got there, it is impossible to tell, because it cannot have been generated by the code that you showed. The value 37 should be a value of att_idx which comes from enumerate() and so must be an int. You are formatting this int with %s ... 37 can't become 3rubbish7. Also that should generate att_idx in order 0, 1, etc etc but you are missing many values and there is nothing conditional inside your loop.

Please show us the code that you actually ran.

Update:

And again, this code won't run:

for idx,att_name in enumerate(header):
    arff_str.write("@attribute '%s' %s
" % (name,attributes[att_name]))

because name is not defined; you probably mean att_name.

Perhaps we can short-circuit all this stuffing about: post a copy of your output file (zipped if it's huge) on the web somewhere so that we can see for ourselves what might be disturbing its consumers. Please do edit your question to say which line(s) exhibits(s) the problem.

By the way, you say some of the data is string rather than integer, and the problem goes away if you coerce the data to int by doing features[name] = int(value) ... what is 'features'?? What is 'name'??

Are any of those strings unicode instead of str?

Update 2 (after bad file posted on net)

No info supplied on which line(s) exhibits(s) the problem. As it turned out, no lines exhibited the described problem with attribute 479. I wrote this checking script:

import re, sys
# sample data line:
# {40 1,101 3,319 2,375 2,525 2,530 bug}
# Looks like all data lines end in ",530 bug}" or ",530 other}"
pattern1 = r"\{(?:\d+ \d+,)*\d+ \w+\}$"
matcher1 = re.compile(pattern1).match
pattern2 = r"\{(?:\d+ \d+,)*"
matcher2 = re.compile(pattern2).match
bad_atts = re.compile(r"\D\d+\s+\W").findall
got_data = False
for lino, line in enumerate(open(sys.argv[1], "r"), 1):
    if not got_data:
        got_data = line.startswith('@data')
        continue
    if not matcher1(line):
        print
        print lino, repr(line)
        m = matcher2(line)
        if m:
            print "OK up to offset", m.end()
            print bad_atts(line)

Sample output (wrapped at column 80):

581 '{2 1,7 1,9 1,12 1,13 1,14 1,15 1,16 1,17 1,18 1,21 1,22 1,24 1,25 1,26 1,27
 1,29 1,32 1,33 1,36 1,39 1,40 1,44 1,48 1,49 1,50 1,54 1,57 1,58 1,60 1,67 1,68
 1,69 1,71 1,74 1,75 1,76 1,77 1,80 1,88 1,93 1,101 ,103 6,104 2,109 20,110 3,11
2 2,114 1,119 17,120 4,124 39,128 5,137 1,138 1,139 1,162 1,168 1,172 18,175 1,1
76 6,179 1,180 1,181 2,185 2,187 9,188 8,190 1,193 1,195 2,196 4,197 1,199 3,201
 3,202 4,203 5,206 1,207 2,208 1,210 2,211 1,212 5,213 1,215 2,216 3,218 2,220 2
,221 3,225 8,226 1,233 1,241 4,242 1,248 5,254 2,255 1,257 4,258 4,260 1,266 1,2
68 1,269 3,270 2,271 5,273 1,276 1,277 1,280 1,282 1,283 11,285 1,288 1,289 1,29
6 8,298 1,299 1,303 1,304 11,306 5,308 1,309 8,310 1,315 3,316 1,319 11,320 5,32
1 11,322 2,329 1,342 2,345 1,349 1,353 2,355 2,358 3,359 1,362 1,367 2,368 1,369
 1,373 2,375 9,377 1,381 4,382 1,383 3,387 1,388 5,395 2,397 2,400 1,401 7,407 2
,412 1,416 1,419 2,421 2,422 1,425 2,427 1,431 1,433 7,434 1,435 1,436 2,440 1,4
49 1,454 2,455 1,460 3,461 1,463 1,467 1,470 1,471 2,472 7,477 2,478 11,479 31,4
82 6,485 7,487 1,490 2,492 16,494 2,495 1,497 1,499 1,501 1,502 1,503 1,504 11,5
06 3,510 2,515 1,516 2,517 3,518 1,522 4,523 2,524 1,525 4,527 2,528 7,529 3,530
 bug}
'
OK up to offset 203
[',101 ,']

709 '{101 ,124 2,184 1,188 1,333 1,492 3,500 4,530 bug}
'
OK up to offset 1
['{101 ,']

So it looks like the attribute with att_idx == 101 can sometimes contain the empty string ''. You need to sort out how this attribute is to be treated. It would help your thinking if you unwound this Byzantine code:

  if ((not att_type == constants.ARRF_NUMERIC)
      or not ((value == None) or value == 0)):

Aside: that "expletive deleted" code won't run; it should be ARFF, not ARRF

into:

if value or att_type != constants.ARFF_NUMERIC:

or maybe just if value: which will filter out all of None, 0, and "". Note that att_idx == 101 corresponds to the attribute "priority" which is given a STRING type in the ARFF file header:

[line 103] @attribute 'priority' STRING

By the way, your statement about features[name] = int(value) "fixing" the problem is very suspicious; int("") raises an exception.

It may help you to read the warning at the end of this wiki section about sparse ARFF files.

Unexpected whitespace in python generated strings

Answers (1)

Related Questions