Harm Josh
Harm Josh

Reputation: 11

Counting the number of spaces between words in a file using python?

I'm really close. I read through "number of space between each word" and it does provide this line:

counts = [(len(list(cpart))) for c,cpart in groupby(s) if c == ' ']

but I really don't understand it... I understand, or am assuming, C is the delimiter, S is what you're grouping by, and you're putting the resulting list?(new to python, array?) into counts (S is referent to a previously instantiated variable)

How would I determine something like this?

                                                  AMOUNT       DATE       
   NAME          ACCOUNT#         DISCOUNT         DUE         DUE

I am creating a program that allows me to look at a randomly created COBOL output file headers and use it to create the PIC(X)'s associated.

Example solution output would be:

  1. PIC X(30) VALUE SPACES.
  2. PIC X(6) VALUE "AMOUNT".
  3. PIC X(8) VALUE SPACES.
  4. PIC X(4) VALUE "DATE".

the important parts are the numbers. I can determine lengths of strings obviously, but the spaces i'm not sure how...

Here is what I have so far to show i'm working lol:

from itertools import groupby
from test.test_iterlen import len
from macpath import split
from lib2to3.fixer_util import String

file = open("C:\\Users\\Joshua\\Desktop\\Practice\\cobol.cbl", 'r+')

line1 = file.readline()
split = line1.split()
print (split)
print ()

counts = [(len(list(cpart))) for c,cpart in groupby(split) if c == ' ']

print (counts)


index = 0
while index != split.__len__():
    if split[index].strip() != None:
        print ("PICX(" + ") VALUE " + "\"" + split[index] + "\".")
    elif counts[index] == None:
        print ("PICX(" + ") VALUE " + "\"" + split[index] + "\".") 
    index+=1

Upvotes: 1

Views: 2463

Answers (2)

Bill Woodger
Bill Woodger

Reputation: 13076

There's no particular point in breaking up the output like that. You coould:

     05  FILLER (optional) PIC X(width-of-report) VALUE
     "                              AMOUNT        DATE             "(in column 72)
-                         ".

The "-" is in column 7, and shows the continuation of an alphanumeric literal, which needs no opening quote, but needs a closing quote.

Your processing to create that is very simple. You always output those three lines, all you have to do is "chop" your data into 59 bytes (for the second line) and "the rest" (not knowing your report width) for the third line.

Upvotes: 0

askewchan
askewchan

Reputation: 46578

I'll begin by explaining the first line:

counts = [(len(list(cpart))) for c,cpart in groupby(s) if c == ' ']

s is actually the input string. So, to run this you'd start with:

s = "   NAME          ACCOUNT#         DISCOUNT         DUE         DUE"

groupby(s) returns an iterator of tuples. The first value in that tuple is the character from the input string, and the second value is another (nested) iterator that will iterate through the repeated values of the character. Put into list form (for illustration) it would look like this:

groupby("hello!!!")
[('h', ['h']), ('e', ['e']), ('l', ['l', 'l']), ('o', ['o']), ('!', ['!', '!', '!'])]

So, c is not a delimiter, but it's the variable that holds each character in the string s, and cpart is the iterator through all the consecutive values of c. Once you call len(cpart) it gives a list of [c,c,c,...] (each item is the same!) and the length of that list is the number of times that the character c is repeated. Normally it will just be one. For example, for the 'A' in 'NAME' you'll get c == A and list(cpart) == ['A']. But for the spaces between NAME and ACCOUNT#, you'll get c == ' ' and cpart == [' ',' ',' ',' ',' ',' ',' ',' ',' ',' '].

The whole thing being inside brackets [] means that it generates a list as if you were appending to a list within a for loop, and the value of each item is the expression before the for. Here, it's the len(list(cpart)) which counts the length of that list of repeated instances of a character. Thus, it'll be a list with the numbers of times a character is repeated. The if c == ' ' means that item will be added to the list only when that character is a space.


The above will count the spaces. To count the words (e.g., to get PIC X(6) VALUE "AMOUNT") you can simply do something like:

word_counts = [ len(word) for word in s.split() ]

where split (which you have used) returns a list of words that had been previously one string separated by spaces.

Upvotes: 3

Related Questions