Reputation: 61

How to get meaningful words by splitting a continuous string?

Well I am trying to parse a particular html response.I have successfully extracted the text from the page in a form of continuous string.

for eg:

The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters and their names wereElsie LacieandTillie \nand they lived at the bottom of a well Blockquote

My 1st Question is I need to split the string to get individual words like eg:

storyOnce

should be converted to a list of meaningful words...

[The,....,story,Once,....]

and I also need to get rid of "\n" characters. I tried using

.strip

but it doesn't seem to work. I thinks I may be using it in wrong way. I am a newbie so please elaborate the answers.That will be helpful.

Upvotes: 1

Answers (4)

Saher Ahwal

Reputation: 9237

For removing the \n chars strip will only work if they are at beginning and end of string.

You can use split instead and attach string back without the \n if you end up splitting on \n

For you initial problem since the text is exactly as you extracted it, what I would do is split on space first

string.split(' ')

which will give something like

[The, Dormouse's, storyThe, Dormouse's, storyOnce, upon, a, time,...]

and then you can use some simple dictionary mapping with a smart algorithm as follows:

Iterate over the resulting list:

Use a dictionary or some NLP library to check for matches (e.g story matches 'storyThe' - so it should split - you can do another check that the rest 'The' exists in dictionary too'
try to smartly ignore names which will not be in dictionary. Some NLP libraries can help with that.

This is a text segmentation problem so you need to use some form of natural language processing to do some tokenization and text extraction.

@WannaBeCoder below suggests NLTK platform and book here: http://www.nltk.org/book/

Have fun this is challenging and cool!

Upvotes: 4

kilojoules

Reputation: 10083

import re
ans = ""
for a in re.findall('[A-Z][^A-Z]*',"The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters and their names wereElsie LacieandTillie \nand they lived at the bottom of a well Blockquote"):
   ans+=a.strip()+' '

ans
"The Dormouse's story The Dormouse's story Once upon a time there were three little sisters and their names were Elsie Lacieand Tillie \nand they lived at the bottom of a well Blockquote "

Upvotes: 0

WannaBeCoder

Reputation: 1282

You probably want text segmentation. From a old link I bookmarked this seems to do the task for you. You could also use NLTK segmentation.

Upvotes: 3

AmeyA

Reputation: 15

I am creating a similar program. I created a word list from sentence using .split(). And compared it to a dictionary. Then for unknown words. I used binary map and created all possible combinations of chunks. Then from those combinations I seperated unique chunks. And compared it to dictionary. Now I have all possible combination of unknown word and parts from the word which are from dictionary. I compared both for everypossible chunk combination of unknown word, so that I have least possible (number of chunks - number of words in that chunk from dictionary).

But my method is time consuming. And has problems with ambiguous lines like 'loveisnowhere'.

Upvotes: 1

How to get meaningful words by splitting a continuous string?

Answers (4)

Related Questions