David D
David D

Reputation: 1545

Can't understand a line in a "Collective intelligence" program

I'm working through "Programming collective intelligence". In chapter 4, Toby Segaran builds an artificial neural network. The following function appears on page of the book:

def generatehiddennode(self,wordids,urls):
  if len(wordids)>3: return None
  # Check if we already created a node for this set of words
  sorted_words=[str(id) for id in wordids]
  sorted_words.sort()
  createkey='_'.join(sorted_words)
  res=self.con.execute(
  "select rowid from hiddennode where create_key='%s'" % createkey).fetchone()

  # If not, create it
  if res==None:
    cur=self.con.execute(
    "insert into hiddennode (create_key) values ('%s')" % createkey)
    hiddenid=cur.lastrowid
    # Put in some default weights
    for wordid in wordids:
      self.setstrength(wordid,hiddenid,0,1.0/len(wordids))
    for urlid in urls:
      self.setstrength(hiddenid,urlid,1,0.1)
    self.con.commit()

What I can't possibly understand is the reason of the first line in this function: 'if len(wordids>3): return None`. Is it a debug code that needs to be removed later?

P.S. this is not a homework

Upvotes: 4

Views: 426

Answers (3)

Gareth Rees
Gareth Rees

Reputation: 65854

For a published book, that's pretty terrible code! (You can download all the examples for the book from here; the relevant file is chapter4/nn.py.)

  • No docstring. What is this function supposed to do? From its name, we can guess that it's generating one of the nodes in the "hidden layer" of a neural network, but what role do the wordids and urls play?
  • Database query uses string substitution and so is vulnerable to SQL injection attacks (especially since this is something to do with web searching, so the wordids probably come from a user query and so may be untrusted—but then, maybe they are ids rather than words so it's OK in practice but still a very bad habit to get into).
  • Not using the expressive power of the database: if all you want to do is to determine if a key exists in the database then you probably want to use a SELECT EXISTS(...) rather than asking the database to send you a bunch of records which you're then going to ignore.
  • Function does nothing if there was already a record with createkey. No error. Is that correct? Who can say?
  • The weighting for the words is scaled to the numbers of words, but the weighting for the urls is the constant 0.1 (perhaps there are always 10 URLs, but it would be better style to scale by len(urls) here).

I could go on and on, but I better not.

Anyway, to answer your question, it looks as though this function is adding a database entry for a node in the hidden layer of a neural network. This neural network has, I think, words in the input layer, and URLs in the output layer. The idea of the application is to attempt to train a neural network to find good search results (URLs) based on the words in the query. See the function trainquery, which takes the arguments (wordids, urlids, selectedurl). Presumably (since there's no docstring I have to guess) wordids were the words the user searched for, urlids are the URLs the search engine offered the user, and selectedurl is the one the user picked. The idea being to train the neural net to better predict which URLs users will pick, and so place those URLs higher in future search results.

So the mysterious line of code is preventing nodes being created in the hidden layer with links to more than three nodes in the input layer. In the context of the search application this makes sense: there's no point in training up the network on queries that are too specialized, because these queries won't recur often enough for the training to be worth it.

Upvotes: 6

gotgenes
gotgenes

Reputation: 40029

You probably should have posted a little more context for code. Here is the paragraph in Programming Collective Intelligence which immediately precedes that code:

This function will create a new node in the hidden layer every time it is passed a combination of words that it has never seen together before. The function then creates default-weighted links between the words and the hidden node, and between the query node and the URL results returned by this query.

I realize it still doesn't help answer your question, but it would have helped Gareth Rees out with his answer by giving less guesswork. Gareth still got it correct, anyway, since he's clever. The intention is to restrict the number of word nodes a hidden node can be associated with, and the author chose the arbitrary number of 3.

Just to agree with Gareth, again, that paragraph should have totally been in the docstring, and the purpose of the line in question should have been in a comment above the line. I hope the next edition isn't so sloppy.

Upvotes: 1

learnvst
learnvst

Reputation: 16195

To elaborate on the above comments look at this simple script...

def doSomething(wordids):
  if len(wordids)>3: return None
  print("The rest of the function executes")


blah = [2,3,4];
doSomething(blah)

blah = [2,3,4,5];
doSomething(blah)

. . so if the length of wordids is longer than 3 then the function does nothing. It is common to check the inputs to functions but errors are normally handled using exceptions in more advanced cases.

Upvotes: 0

Related Questions