Reputation: 1838

Python not splitting CRLF correctly

I'm writing a script to convert very simple function documentation to XML in python. The format I'm using would convert:

date_time_of(date) Returns the time part of the indicated date-time value, setting the date part to 0.

to:

<item name="date_time_of">

<arg>(date)</arg>

<help> Returns the time part of the indicated date-time value, setting the date part to 0.</help>

</item>

So far it works great (the XML I posted above was generated from the program) but the problem is that it should be working with several lines of documentation pasted, but it only works for the first line pasted into the application. I checked the pasted documentation in Notepad++ and the lines did indeed have CRLF at the end, so what is my problem? Here is my code:

mainText = input("Enter your text to convert:\r\n")

try:
    for line in mainText.split('\r\n'):
        name = line.split("(")[0]
        arg = line.split("(")[1]
        arg = arg.split(")")[0]
        hlp = line.split(")",1)[1]
        print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
except:
    print("Error!")

Any idea of what the issue is here? Thanks.

Upvotes: 0

Answers (3)

eyquem

Reputation: 27585

Patrick Moriarty,

It seems to me that you didn't particularly mention the console and that your main concern is to pass several lines together at one time to be treated. There's only one manner in which I could reproduce your problem: it is, executing the program in IDLE, to copy manually several lines from a file and pasting them to raw_input()

Trying to understand your problem led me to the following facts:

when data is copied from a file and pasted to raw_input() , the newlines \r\n are transformed into \n , so the string returned by raw_input() has no more \r\n . Hence no split('\r\n') is possible on this string
pasting in a Notepad++ window a data containing isolated \r and \n characters, and activating display of the special characters, it appears CR LF symbols at all the extremities of the lines, even at the places where there are \r and \n alone. Hence, using Notepad++ to verify the nature of the newlines leads to erroneous conclusion

The first fact is the cause of your problem. I ignore the prior reason of this transformation affecting data copied from a file and passed to raw_input() , that's why I posted a question on stackoverflow:

Strange vanishing of CR in strings coming from a copy of a file's content passed to raw_input()

The second fact is responsible of your confusion and despair. Not a chance....

So, what to do to solve your problem ?

Here's a code that reproduce this problem. Note the modified algorithm in it, replacing your repeated splits applied to each line.

ch = "date_time_of(date) Returns the time part.\r\n"+\
     "divmod(a, b) Returns quotient and remainder.\r\n"+\
     "enumerate(sequence[, start=0]) Returns an enumerate object.\r\n"+\
     "A\rB\nC"

with open('funcdoc.txt','wb') as f:
    f.write(ch)

print "Having just recorded the following string in a file named 'funcdoc.txt' :\n"+repr(ch)

print "open 'funcdoc.txt' to manually copy its content, and paste it on the following line"
mainText = raw_input("Enter your text to convert:\n")
print "OK, copy-paste of file 'funcdoc.txt' ' s content has been performed"


print "\nrepr(mainText)==",repr(mainText)

try:
    for line in mainText.split('\r\n'):  
        name,_,arghelp  = line.partition("(")
        arg,_,hlp = arghelp.partition(") ")
        print('<item name="%s">\n<arg>(%s)</arg>\n<help>%s</help>\n</item>\n' % (name,arg,hlp))
except:
    print("Error!")

Here's the solution mentioned by delnan : « read from the source instead of having a human copy and paste it. » It works with your split('\r\n') :

ch = "date_time_of(date) Returns the time part.\r\n"+\
     "divmod(a, b) Returns quotient and remainder.\r\n"+\
     "enumerate(sequence[, start=0]) Returns an enumerate object.\r\n"+\
     "A\rB\nC"

with open('funcdoc.txt','wb') as f:
    f.write(ch)

print "Having just recorded the following string in a file named 'funcdoc.txt' :\n"+repr(ch)

#####################################

with open('funcdoc.txt','rb') as f:
    mainText = f.read()

print "\nfile 'funcdoc.txt' has just been opened and its content copied and put to mainText"

print "\nrepr(mainText)==",repr(mainText)
print

try:
    for line in mainText.split('\r\n'):  
        name,_,arghelp  = line.partition("(")
        arg,_,hlp = arghelp.partition(") ")
        print('<item name="%s">\n<arg>(%s)</arg>\n<help>%s</help>\n</item>\n' % (name,arg,hlp))
except:
    print("Error!")

And finally, here's the solution of Python to process the altered human copy: providing the splitlines() function that treat all kind of newlines (\r or \n or \r\n) as splitters. So replace

for line in mainText.split('\r\n'):

for line in mainText.splitlines():

Upvotes: 0

Benson

Reputation: 22847

The best way to handle reading lines from standard input (the console) is to iterate over the sys.stdin object. Rewritten to do this, your code would look something like this:

from sys import stdin
try:
  for line in stdin:
    name = line.split("(")[0]
    arg = line.split("(")[1]
    arg = arg.split(")")[0]
    hlp = line.split(")",1)[1]
    print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
except:
    print("Error!")

That said, It's worth noting that your parsing code could be significantly simplified with a little help from regular expressions. Here's an example:

import re, sys

for line in sys.stdin:
  result = re.match(r"(.*?)\((.*?)\)(.*)", line)
  if result:
    name = result.group(1)
    arg  = result.group(2).split(",")
    hlp  = result.group(3)
    print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
  else:
    print "There was an error parsing this line: '%s'" % line

I hope this helps you simplify your code.

Upvotes: 0

Mark Tolonen

Reputation: 177961

input() only reads one line.

Try this. Enter a blank line to stop collecting lines.

lines = []
while True:
    line = input('line: ')
    if line:
        lines.append(line)
    else:
        break
print(lines)

Upvotes: 4

Python not splitting CRLF correctly

Answers (3)

Related Questions