karianneberg
karianneberg

Reputation: 328

How can I use machine learning to extract larger chunks of text from a document?

I am currently learning about machine learning, as I think it might be helpful to solve a problem I have. However, I am unsure about what techniques I should apply to solve my problem. I apologise in advance for probably not knowing enough about this field to even ask a proper question.

What I want to to is extract the significant parts of a knitting pattern (the actual pattern, not all the intro and stuff like that). For instance, I would like to feed this web page into my program and get out something like this:

{
    title: "Boot Style Red and White Baby Booties for Cold Weather"
    directions: "
    Right Bootie.
    Cast on (31, 43) with white color.
    Rows (1, 3, 5, 7, 9, 10, 11): K.
    Row 2: K1, M1, (K14, K20), M1, K1, M1, (K14, K20), M1, K1. (35, 47 sts)
    Row 4: K2, M1, (K14, K20), M1, K3, M1, (K14, K20), M1, K2. (39, 51 sts)
    Row 6: K3, M1, (K14, K20), M1, K5, M1, (K14, K20), M1, K3. (43, 55 sts)
    ..."
}

I've been reading about extracting smaller parts, like sentences and words, and also about stuff like Named Entity Recognition, but they all seem to be focused on very small parts of the text.

My current thoughts are to use supervised learning, but I'm also very unsure about how to extract features out of the text. Naive methods like using letters, words or even sentences as features seems like they wouldn't be relevant enough to yield any kind of satisfactory results (and also, there would be tons of features, unless I use some kind of sampling), but what are really the significant features for finding out which parts are what in a knitting pattern?

Can someone point me in the right direction of algorithms and methods to do extraction of larger portions of the text?

Upvotes: 1

Views: 1070

Answers (1)

yvespeirsman
yvespeirsman

Reputation: 3099

One way to see this is as a straightforward classification problem: for each sentence in the page, you want to determine if it's relevant to you or not. Optionally, you have different classes of relevant sentences, such as "title" and "directions".

As a result, for each sentence you need to extract the features that contain information about its status. This will likely involve tokenizing the sentence, and possibly applying some type of normalization. Initially I would focus on features such as individual words (M1, K1, etc.) or n-grams (a number of adjacent words). Yes, there are many of them, but a good classifier will learn which features are informative, and which are not. If you're really worried about data sparseness, you can also reduce the number of features by mapping similar "words" such as M1 and K1 to the same feature.

Additionally, you will need to label a set of example sentences, to serve as the training and test sets for your classifier. This will allow you to train the system, evaluate its performance and compare different approaches.

To start, you can experiment with some simple, but popular classification methods such as Naive Bayes.

Upvotes: 1

Related Questions