doxav
doxav

Reputation: 988

Cython string management and memory

I just started with Cython. I tried to port this Python function to Cython to improve its speed but it is now slower. It was first a little bit faster but I tried to improve string concatenation and memory allocation because they correspond to 80% of the processing time, but then it get worst :). I certainly chose the slowest solution for string concatenation and memory allocation.

Should I use malloc ? Are multiple strcat a very bad option in Cython, what would be a better option ?

PS: I didn't re-use my python optimization (use intermediate arrays of string and joining to a string at the end) because I couldn't find the right way to do the join at the end in Cython.

New Cython code (UPDATED: replaced strcat by sprintf, track string size to avoid re-computing it):

#
# Input: takes an array of string in input (typically extracted from a csv) with 
# different options to help better formatting the output.
# 
# It then builds a string made of the different part of this array. Those data are used for
# classification (target class, tag, different float/int data, different categorical data).
# 
# Ouput: it returns a string compliant to vowpal wabbit input format.
# 
# Sample input:
# array_to_vw(["No", "10990", "32", "64", "28", "dog", "food", "jedi"], False, 0, [], [2, 3, 4], 1, [0, 1], 
# re.compile('Y(es)?|T(rue)?|\+?1'), re.compile('No?|F(alse)?|0|-1'), re.compile('^(\-)?[0-9.]*$')) 
# output:
# "-1 '10990 |i f2:32 f3:64 f4:28 | dog food jedi"
# 
# note: Regex are usually stored not compiled whenever calling this function
# 
def array_to_vw(data, train=False, int category_index=0, categorical_colnums=[], numerical_colnums=[], int tag_index=-1, skipped_idx=[0], cpregex=None, cnregex=None, cfloatregex=None):  cdef char[5] category

  cdef char[20] category
  cdef char[1000] outline_array
  cdef char[500] categorical_array
  cdef char[20] col
  cdef char[20] tag

  cdef int outline_array_len = 0
  cdef int categorical_array_len = 0
  cdef int colnum = 0

  categorical_array[0] = 0
  strcpy(category, data[category_index])

  categorical_array_len = sprintf(categorical_array, "| ")

  if cpregex.match(category): # regex for positive category
    strcpy(category, '1')
  elif cnregex.match(category): # regex for negative category
    strcpy(category, '-1')
  elif train: # if no valid class but in train mode, set default positive value
    strcpy(category, '1')
  else:
    sys.exit("Category's regex did not match a record:\n" + category)
  # format the beginning of the string output (change if a tag is specified)
  if tag_index > -1:
    strcpy(tag, data[tag_index])
    outline_array_len = sprintf(outline_array, "%s '%s |i ", category, tag)
  else:
    outline_array_len = sprintf(outline_array, "%s |i ", category)

  for colnum in range(len(data)):
    if sprintf(col, data[colnum]) > 0 and colnum not in skipped_idx:
      if colnum in categorical_colnums:
        categorical_array_len += sprintf(categorical_array + categorical_array_len, "%s ", col)
      elif colnum in numerical_colnums:
        outline_array_len += sprintf(outline_array + outline_array_len, "f%d:%s ", colnum, col)
      else:
        if cfloatregex.match(data[colnum]): # If the feature is a number, then give it a label
          outline_array_len += sprintf(outline_array + outline_array_len, "f%d:%s ", colnum, col)
        else: # If the feature is a string, then let vw handle it directly
          categorical_array_len += sprintf(categorical_array + categorical_array_len, "%s ", col)

  if categorical_array_len > 2:
    sprintf(outline_array + outline_array_len, "%s\n", categorical_array)
  else:
    strcpy(outline_array + outline_array_len, "\n")

  #print outline_array
  return outline_array

Initial Python code:

  def array_to_vw(data, train=False, category_index=0, categorical_colnums=[], numerical_colnums=[], tag_index=-1, skipped_idx=[0], cpregex=None, cnregex=None, cfloatregex=None):
    # providing pre-compiled regex to array_to_vw() really improve performance if array_to_vw() is called many times
    if cfloatregex is None: 
      cfloatregex = re.compile('^(\-)?([0-9.]+e(\+|-))?[0-9.]+$')
    if cpregex is None:
      cpregex = re.compile('Y(es)?|T(rue)?|\+?1')
    if cnregex is None:
      cnregex = re.compile('No?|F(alse)?|0|-1')

    category = data[category_index]
    #if re.search(pregex, category): # regex for positive category
    if cpregex.match(category): # regex for positive category
      category = '1'
    #elif re.search(nregex, category): # regex for negative category
    elif cnregex.match(category): # regex for negative category
      category = '-1'
    elif train: # if no valid class but in train mode, set default positive value
      category = '1'
    else:
      sys.exit("Regex did not match a record, exiting.\nPostive Regex: " + pregex + "\nNegative Regex: "+ nregex + "\nRecord:\n" + str(data))

    if tag_index > -1:
      tag = data[tag_index]
      outline_array = [category, " '", tag, " |i "]
    else:
      outline_array = [category, "| "]

    colnum = 0
    categorical_array = ['| ']
    for col in data:
      if col and colnum not in skipped_idx:
        if colnum in categorical_colnums:
          #outline_array.extend([col, ' '])
          categorical_array.extend([col, ' '])
        elif colnum in numerical_colnums:
          outline_array.extend(['f', str(colnum), ':', col, ' '])
          #outline_array = "%sf%s:%s " % (outline_array, str(colnum), col)
        else:
          colstr = str(colnum)
          if cfloatregex.match(col): # If the feature is a number, then give it a label
            #print "numerical col:", colstr
            outline_array.extend(['f', colstr, ':', col, ' '])
          else: # If the feature is a string, then let vw handle it directly
            #print "categorical col:", colstr
            categorical_array.extend([col, ' '])
            # once a column is detected as a string/categorical it shouldn't be processed differently later on
            categorical_colnums.append(colnum)

      #colnum = colnum + 1
      colnum += 1

    #if len(categorical_array) > 1:
    #  outline_array.extend(categorical_array)
    outline_array.extend(categorical_array)
    outline_array.extend("\n")

    return "".join(outline_array)

Upvotes: 2

Views: 785

Answers (1)

John Zwinck
John Zwinck

Reputation: 249133

You can speed this up massively if you vectorize your operations using NumPy. For example, here is how you can take the input data list/array and determine which indexes are numeric, without using a (relatively slow) regex:

>>> data = ["No", "10990", "32", "64", "28", "dog", "food", "jeddy"]

>>> np.genfromtxt(np.array(data))
array([nan, 10990., 32., 64., 28., nan, nan, nan])

>>> np.isnan(np.genfromtxt(np.array(data)))
array([True, False, False, False, False, True, True, True], dtype=bool)

Hopefully this gives you a taste of what's possible. At the very least this will completely eliminate the regex matching, which is probably one of the slower parts of your code now. And it doesn't require Cython.

Upvotes: 2

Related Questions