user2027080
user2027080

Reputation: 125

Calling a Snowball/Porter2 Stemmer from T-SQL

I'm trying to come up with an easy way for analysts at my office to invoke a stemmer from MSSQL. It would be used to generate stemmed notes fields for two purposes:

  1. to create training sets where the most common stemmed notes fields are counted and aggregated, then classified starting with the most frequently appearing first. (For our application we're going to have a lot of commonly repeating phrases and a lot of one off stuff).

  2. to generate derived stemmed versions of the notes fields to run classification models on.

I'm thinking it would be easy for everyone if I were able to create a SQL function called maybe StemText that took a text field and returned the stemmed version. Then I could create another function which invoked a specified classification model name in R (in this case on the stemmed notes field).

We currently have a bunch of stemmed training set data (and successful classification systems) built around the Snowball/Porter2 stemmer. I saw that there was a T-SQL port of the Porter1 stemmer, but not Porter2. In any case, I see that there are a number of packages out there which include the Snowball algorithm like SOLR/Lucene and so forth, but I'm not seeing much info on how to invoke them from T-SQL. I also see that there is a snowball add on package for R and there is a lot more info out there for interfacing with R from MSSQL.

Does anyone have any thoughts or advice on this approach?

Upvotes: 0

Views: 212

Answers (1)

user2027080
user2027080

Reputation: 125

I ended up creating a CLR function using a Visual Studio SQL Project that wrapped and augmented one of the official CSharp ports of the Porter2 stemmer. I runs quite fast and was pretty easy to setup.

Upvotes: 1

Related Questions