How can I use machine learning to extract larger chunks of text from a document?

Question

I am currently learning about machine learning, as I think it might be helpful to solve a problem I have. However, I am unsure about what techniques I should apply to solve my problem. I apologise in advance for probably not knowing enough about this field to even ask a proper question.

What I want to to is extract the significant parts of a knitting pattern (the actual pattern, not all the intro and stuff like that). For instance, I would like to feed this web page into my program and get out something like this:

{
    title: "Boot Style Red and White Baby Booties for Cold Weather"
    directions: "
    Right Bootie.
    Cast on (31, 43) with white color.
    Rows (1, 3, 5, 7, 9, 10, 11): K.
    Row 2: K1, M1, (K14, K20), M1, K1, M1, (K14, K20), M1, K1. (35, 47 sts)
    Row 4: K2, M1, (K14, K20), M1, K3, M1, (K14, K20), M1, K2. (39, 51 sts)
    Row 6: K3, M1, (K14, K20), M1, K5, M1, (K14, K20), M1, K3. (43, 55 sts)
    ..."
}

I've been reading about extracting smaller parts, like sentences and words, and also about stuff like Named Entity Recognition, but they all seem to be focused on very small parts of the text.

My current thoughts are to use supervised learning, but I'm also very unsure about how to extract features out of the text. Naive methods like using letters, words or even sentences as features seems like they wouldn't be relevant enough to yield any kind of satisfactory results (and also, there would be tons of features, unless I use some kind of sampling), but what are really the significant features for finding out which parts are what in a knitting pattern?

Can someone point me in the right direction of algorithms and methods to do extraction of larger portions of the text?

How can I use machine learning to extract larger chunks of text from a document?

Answers (1)

Related Questions