Paul
Paul

Reputation: 6176

How to understand this code, to split an array in Python?

I'm a bit confused about a line in Python:

We use Python and a custom function to split a line: we want what is between quotes to be a single entry in the array.

The line is, for example:

"La Jolla Bank, FSB",La Jolla,CA,32423,19-Feb-10,24-Feb-10

So "La Jolla Bank, FSB" should be a single entry in the array.

And I'm not sure to understand this code:

  1. The first char is a quote '"', so the variable "quote" is set to its inverse, so set to "TRUE".

  2. Then we check the comma, AND if quote is set to its inverse, so if quote is TRUE, which is the case when we are inside the quotes.

  3. We cut it with current="", and this is where I don't understand: we are still between the quotes, so normally we should not cut it now! edit: so and not quote means "false", and not "the opposite of", thanks !

Code:

def mysplit (string):
    quote = False
    retval = []
    current = ""
    for char in string:
        if char == '"':
            quote = not quote
        elif char == ',' and not quote: #the first coma is still in the quotes, and quote is set to TRUE, so we should not cut current here...
            retval.append(current) 
            current = "" 
        else:
            current += char
    retval.append(current)
    return retval

Upvotes: 0

Views: 174

Answers (6)

Sanjay Manohar
Sanjay Manohar

Reputation: 7026

OK you aren't quite there!

1.the first char is a quote ' " ', so the variable "quote" is set to its inverse, so set to "TRUE".

good! so quote was set to the inverse of whatever it was previously. At the beginning of the prog, it was false, so when " is seen, it becomes true. But vice versa, if it was True, and a quote is seen, it becomes false.

In other words, this line of the program changes quote from whatever is was before that line. It is called 'toggling'.

  1. then we check the coma, AND if quote is set to its inverse, so if quote is TRUE, which is the case when we are inside the quotes.

This isn't quite right. not quote means "only if quote is false". This has nothing to do with whether it is 'set to its inverse'. No variable can be equal to its own inverse! it is like saying X=True and X=False - obviously nonsense.

quote is always either True or False - and nothing else!

3.we cut it with current="", and this is where i don't understand : we are still between the quotes, so norm ally we should not cut it now!

So hopefully you can see now that, you are not between the quotes if you reach this line. the not quote ensures that you don't cut inside a quote, because not quote really means just that - not in a quote!

Upvotes: 1

Vlad
Vlad

Reputation: 9481

This is not the best code for splitting but it is pretty straight forward

   1 current = ""

   # First you set current to empty string, the following line
   # will loop through the string to be split and pull characters out of it
   # one by one... setting 'char' to be the value of next character

   2 for char in string:

   # the following code will check if the line we are currently inside of the quote
   # if otherwise it will add the current character to the the 'current' variable
   # 

   3     if char == '"':
   4         quote = not quote
   5     elif char == ',' and not quote:
   6         retval.append(current) 

   ### if we see the comma, it will append whatever is accumulated in current to the 
   ### return result.
   ### then you have to reset the value in the current to let the next word accumulate


   7         current = "" #why do we cut current here? 
   8     else:
   9         current += char

   ### after the last char is seen, we still have left over characters in current which
   ### we can just shove into the final result

   10 retval.append(current)
   11 return retval


   Here is an example run:

   Let string be  'a,bbb,ccc

   Step  char  current   retval

    1     a      a        {}
    2     ,               {a}       ### Current is reset
    3     b      b        {a}
    4     b      bb       {a} 
    5     b      bbb      {a}
    6     ,               {a,bbb}   ### Current is reset

   and so on

Upvotes: 1

rossipedia
rossipedia

Reputation: 59397

What you're looking at is a condensed version of a Finite State Machine, used by most language parsing programs.

Let's see if I can't annotate it:

def mysplit (string):
    # We start out at the beginning of the string NOT in between quotes
    quote = False
    # Hold each element that we split out
    retval = []
    # This variable holds  whatever the current item we're interested in is
    # e.g: If we're in a quote, then it's everything (including commas)
    # otherwise it's every UP UNTIL the next comma
    current = ""
    # Scan the string character by character
    for char in string:
        # We hit a quote, so turn on QUOTE SCANNING MODE!!!
        # If we're in quote scanning mode, turn it off
        if char == '"':
            quote = not quote
        # We hit a comma, and we're not in quote scanning mode
        elif char == ',' and not quote:
            # We got what we want, let's put it in the return value
            # and then reset our current item to nothing so we can prepare for the next item.
            retval.append(current) 
            current = "" 
        else:
            # Nothing special, let's just keep building up our current item
            current += char
    # We're done with all the characters, let's put together whatever we were working on when we ran out of characters
    retval.append(current)
    # Return it!
    return retval

Upvotes: 1

enticedwanderer
enticedwanderer

Reputation: 4346

current is reset to empty because in the case where you have encountered ',' and you are not under "" quotes you should interpret that as an end of a "token".

This is definitely not pythonic, for char in string makes me cringe and whoever wrote this code should have used regex.

Upvotes: 1

John Weldon
John Weldon

Reputation: 40789

You're viewing it as though both if char == '"' and elif char == ',' and not quote were run.

However the if statement explicitly makes it so that only one will run.

Either, quote will be inverted OR the current value will get cut.

In the case where the current char is ", then the logic will be called to invert the quote flag. But the logic to cut the string will not run.

In the case where the current char is ,, then the logic for inverting the flag will NOT run, but the logic to cut the string will if the quote flag is not set.

Upvotes: 3

retracile
retracile

Reputation: 12339

That is initializing current to the empty string, wiping out whatever it may have been set to before.

As long as you are not inside quotes (ie. quote is False), when you see a ,, you have hit the end of the field. Whatever you have accumulated into current is the content of that field, so append it to retval and reset current to the empty string, ready for the next field.

That said, this looks like you're dealing with a .csv input. There is a csv module that can deal with this for you.

Upvotes: 1

Related Questions