Jonatan
Jonatan

Reputation: 61

Python, working with strings

i need to construct a program to my class which will : read a messed text from file and give this text a book form so from input:

This    is programing   story , for programmers  . One day    a variable
called
v  comes    to a   bar    and ordred   some whiskey,   when suddenly 
      a      new variable was declared .
a new variable asked : "    What did you ordered? "

into output

This is programing story,
for programmers. One day 
a variable called v comes
to a bar and ordred some 
whiskey, when suddenly a 
new variable was 
declared. A new variable
asked: "what did you 
ordered?"

I am total beginner at programming, and my code is here

   def vypis(t):
    cely_text = ''
    for riadok in t:
        cely_text += riadok.strip()
    a = 0     
    for i in range(0,80):
        if cely_text[0+a] == " " and cely_text[a+1] == " ":
            cely_text = cely_text.replace ("  ", " ")
        a+=1
    d=0    
    for c in range(0,80):
        if cely_text[0+d] == " " and (cely_text[a+1] == "," or cely_text[a+1] == "." or cely_text[a+1] == "!" or cely_text[a+1] == "?"):
            cely_text = cely_text.replace (" ", "")
        d+=1   
def vymen(riadok):
    for ch in riadok:
        if ch in '.,":':
            riadok = riadok[ch-1].replace(" ", "")
x = int(input("Zadaj x"))
t = open("text.txt", "r")
v = open("prazdny.txt", "w")
print(vypis(t))  

This code have deleted some spaces and i have tried to delete spaces before signs like " .,_?" but this do not worked why ? Thanks for help :)

Upvotes: 0

Views: 609

Answers (2)

val
val

Reputation: 8689

You want to do quite a lot of things, so let's take them in order:

Let's get the text in a nice text form (a list of strings):

>>> with open('text.txt', 'r') as f:
...     lines = f.readlines()

>>> lines
['This    is programing   story , for programmers  . One day    a variable', 
 'called', 'v  comes    to a   bar    and ordred   some whiskey,   when suddenly ',
 '      a      new variable was declared .', 
 'a new variable asked : "    What did you ordered? "']

You have newlines all around the place. Let's replace them by spaces and join everything into a single big string:

>>> text = ' '.join(line.replace('\n', ' ') for line in lines)

>>> text
'This    is programing   story , for programmers  . One day    a variable called v  comes    to a   bar    and ordred   some whiskey,   when suddenly        a      new variable was declared . a new variable asked : "    What did you ordered? "'

Now we want to remove any multiple spaces. We split by space, tabs, etc... and keep only the non-empty words:

>>> words = [word for word in text.split() if word]
>>> words
['This', 'is', 'programing', 'story', ',', 'for', 'programmers', '.', 'One', 'day', 'a', 'variable', 'called', 'v', 'comes', 'to', 'a', 'bar', 'and', 'ordred', 'some', 'whiskey,', 'when', 'suddenly', 'a', 'new', 'variable', 'was', 'declared', '.', 'a', 'new', 'variable', 'asked', ':', '"', 'What', 'did', 'you', 'ordered?', '"']

Let us join our words by spaces... (only one this time)

>>> text = ' '.join(words)
>>> text
'This is programing story , for programmers . One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared . a new variable asked : " What did you ordered? "'

We now want to remove all the <SPACE>., <SPACE>, etc...:

>>> for char in (',', '.', ':', '"', '?', '!'):
...     text = text.replace(' ' + char, char)
>>> text
'This is programing story, for programmers. One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared. a new variable asked:" What did you ordered?"'

OK, the work is not done as the " are still messed up, the upper case are not set etc... You can still incrementally update your text. For the upper case, consider for instance:

>>> sentences = text.split('.')
>>> sentences
['This is programing story, for programmers', ' One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared', ' a new variable asked:" What did you ordered?"']

See how you can fix it ? The trick is to take only string transformations such that:

  1. A correct sentence is UNCHANGED by the transformation
  2. An incorrect sentence is IMPROVED by the transformation

This way you can compose them an improve your text incrementally.

Once you have a nicely formatted text, like this:

>>> text
'This is programing story, for programmers. One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared. A new variable asked: "what did you ordered?"'

You have to define similar syntactic rules for printing it out in book format. Consider for instance the function:

>>> def prettyprint(text):
...     return '\n'.join(text[i:i+50] for i in range(0, len(text), 50))

It will print each line with an exact length of 50 characters:

>>> print prettyprint(text)
This is programing story, for programmers. One day
 a variable called v comes to a bar and ordred som
e whiskey, when suddenly a new variable was declar
ed. A new variable asked: "what did you ordered?"

Not bad, but can be better. Just like we previously juggled with text, lines, sentences and words to match the syntactic rules of English language, with want to do exactly the same to match the syntactic rules of printed books.

In that case, both the English language and printed books work on the same units: words, arranged in sentences. This suggests we might want to work on these directly. A simple way to do that is to define your own objects:

>>> class Sentence(object):
...     def __init__(self, content, punctuation):
...         self.content = content
...         self.endby = punctuation
...     def pretty(self):
...         nice = []
...         content = self.content.pretty()
...         # A sentence starts with a capital letter
...         nice.append(content[0].upper())
...         # The rest has already been prettified by the content
...         nice.extend(content[1:])
...         # Do not forget the punctuation sign
...         nice.append('.')
...         return ''.join(nice)

>>> class Paragraph(object):
...     def __init__(self, sentences):
...         self.sentences = sentences
...     def pretty(self):
...         # Separating our sentences by a single space
...         return ' '.join(sentence.pretty() for sentence in sentences)

etc... This way you can represent your text as:

>>> Paragraph(
...   Sentence(
...     Propositions([Proposition(['this', 
...                                'is', 
...                                'programming', 
...                                'story']),
...                   Proposition(['for',
...                                'programmers'])],
...                   ',')
...     '.'),
...   Sentence(...

etc...

Converting from a string (even a messed up one) to such a tree is relatively straightforward as you only break down to the smallest possible elements. When you want to print it in book format, you can define your own book methods on each element of the tree, e.g. like this, passing around the current line, the output lines and the current offset on the current line:

 class Proposition(object):
      ...
      def book(self, line, lines, offset, line_length):
          for word in self.words:
              if offset + len(word) > line_length:
                  lines.append(' '.join(line))
                  line = []
                  offset = 0
              line.append(word)
          return line, lines, offset

 ...

 class Propositions(object):
     ...
     def book(self, lines, offset, line_length):
         lines, offset = self.Proposition1.book(lines, offset, line_length)
         if offset + len(self.punctuation) + 1 > line_length: 
              # Need to add the punctuation sign with the last word
              # to a new line
              word = line.pop()
              lines.append(' '.join(line))
              line = [word + self.punctuation + ' ']
              offset = len(word + self.punctuation + ' ')
         line, lines, offset = self.Proposition2.book(lines, offset, line_length)
         return line, lines, offset

And work your way up to Sentence, Paragraph, Chapter...

This is a very simplistic implementation (and actually a non-trivial problem) which does not take into account syllabification or justification (which you would probably like to have), but this is the way to go.

Note that I did not mention the string module, string formatting or regular expressions which are tools to use once you can define your syntactic rules or transformations. These are extremely powerful tools, but the most important here is to know exactly the algorithm to transform an invalid string into a valid one. Once you have some working pseudocode, regexps and format strings can help you achieve it with less pain than plain character iteration. (in my previous example of tree of words for instance, regexps can tremendously ease the construction of the tree, and Python's powerful string formatting functions can make the writing of book or pretty methods much easier).

Upvotes: 3

David Neale
David Neale

Reputation: 17018

To strip the multiple spaces you could use a simple regex substitution.

import re
cely_text = re.sub(' +',' ', cely_text)

Then for punctuation you could run a similar sub:

cely_text = re.sub(' +([,.:])','\g<1>', cely_text)

Upvotes: 1

Related Questions