Reputation: 61
Well I am trying to parse a particular html response.I have successfully extracted the text from the page in a form of continuous string.
for eg:
The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters and their names wereElsie LacieandTillie \nand they lived at the bottom of a well Blockquote
My 1st Question is I need to split the string to get individual words like eg:
storyOnce
should be converted to a list of meaningful words...
[The,....,story,Once,....]
and I also need to get rid of "\n" characters. I tried using
.strip
but it doesn't seem to work. I thinks I may be using it in wrong way. I am a newbie so please elaborate the answers.That will be helpful.
Upvotes: 1
Views: 3445
Reputation: 9237
For removing the \n
chars strip will only work if they are at beginning and end of string.
You can use split
instead and attach string back without the \n
if you end up splitting on \n
For you initial problem since the text is exactly as you extracted it, what I would do is split on space first
string.split(' ')
which will give something like
[The, Dormouse's, storyThe, Dormouse's, storyOnce, upon, a, time,...]
and then you can use some simple dictionary mapping with a smart algorithm as follows:
Iterate over the resulting list:
This is a text segmentation problem so you need to use some form of natural language processing to do some tokenization and text extraction.
@WannaBeCoder below suggests NLTK platform and book here: http://www.nltk.org/book/
Have fun this is challenging and cool!
Upvotes: 4
Reputation: 10083
import re
ans = ""
for a in re.findall('[A-Z][^A-Z]*',"The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters and their names wereElsie LacieandTillie \nand they lived at the bottom of a well Blockquote"):
ans+=a.strip()+' '
ans
"The Dormouse's story The Dormouse's story Once upon a time there were three little sisters and their names were Elsie Lacieand Tillie \nand they lived at the bottom of a well Blockquote "
Upvotes: 0
Reputation: 1282
You probably want text segmentation. From a old link I bookmarked this seems to do the task for you. You could also use NLTK segmentation.
Upvotes: 3
Reputation: 15
I am creating a similar program. I created a word list from sentence using .split(). And compared it to a dictionary. Then for unknown words. I used binary map and created all possible combinations of chunks. Then from those combinations I seperated unique chunks. And compared it to dictionary. Now I have all possible combination of unknown word and parts from the word which are from dictionary. I compared both for everypossible chunk combination of unknown word, so that I have least possible (number of chunks - number of words in that chunk from dictionary).
But my method is time consuming. And has problems with ambiguous lines like 'loveisnowhere'.
Upvotes: 1