Reputation: 41
How to extract words from a string, having these words separated by punctuation, whitespaces, digits, etc...Without using any split
,replace
, or a library like re
. I'm still learning python and the book recommends finding solutions without resorting to using list and string methods.
Example Input : The@Tt11end
Example Output: ["The", "Tt", "end"]
This is my attempt so far:
def extract_words(sentence):
words_list = []
separator = [",",".",";","'","?","/","<",">","@","!","#","$","%","^","&","*","(",")","-","_","1","2","3","4","5","6","7","8","9"]
counter= 0
for i in range(len(sentence)):
i=counter
while(is_letter(sentence[i])):
words+= sentence[i]
i = i+1
counter=counter+1
words_list.append(words)
words=""
return words_list
My logic is to read the string until a non alphabetic letter is reached, afterwards append it to a words list, and then going through the string again from where I left off.
The output is wrong nevertheless:
['The', '', '', '', '', '', '', '', '', '', '']
Edit: this is my is_letter()
method:
def is_letter(char):
return ("A" <= char and char <= "Z") or \
("a" <= char and char <= "z")
Upvotes: 0
Views: 1657
Reputation: 22360
With minimal changes to your current code you could iterate over the string one character at a time and utilize the list of separators you already have as a set for O(1) lookup time. This will make it so you don't have to worry about incrementing multiple counter variables:
def extract_words(sentence):
separator_set = set([",",".",";","'","?","/","<",">","@","!","#","$","%","^","&","*","(",")","-","_","1","2","3","4","5","6","7","8","9"])
words_list = []
word = []
for c in sentence:
if c not in separator_set:
word.append(c)
else:
if len(word) > 0:
words_list.append(''.join(word))
word = []
if len(word) > 0:
words_list.append(''.join(word))
return words_list
def is_letter(char):
return ("A" <= char and char <= "Z") or ("a" <= char and char <= "z")
def main():
print(extract_words("The@Tt11end"))
if __name__ == '__main__':
main()
Output:
['The', 'Tt', 'end']
Upvotes: 1
Reputation: 554
This code does it:
def extract_words(sentence):
sentence = list(sentence)
words_list = []
separator = [",",".",";","'","?","/","<",">","@","!","#","$","%","^","&","*","(",")","-","_","1","2","3","4","5","6","7","8","9"]
bufferS = []
for i in range(len(sentence)):
if sentence[i] not in separator:
bufferS.append(sentence[i])
else:
words_list.append(''.join(bufferS))
bufferS = []
words_list.append(''.join(bufferS))
words_list = [x for x in words_list if x != '']
return words_list
The way it works is quite simple. I break the string in a list using list(sentence)
.
Before the loop, I define a list that will hold the letters - bufferS
.
Then I iterate over the list, and if the character at sentence[i]
is not in the separator list, I add it to bufferS
.
Once I find a character that is on the separator list, I add the ''.join(bufferS)
(which creates a string based on the list) to the words list, and reset bufferS
.
Test it with:
print(extract_words('aaaaaaa,bbbbbbb*ccccc,dddd'))
It returns
['aaaaaaa', 'bbbbbbb', 'ccccc', 'dddd']
No library used.
Upvotes: 0
Reputation: 7812
Best is to use regular expression there, but if you want some exotic ... here it is:
str = "The@Tt11end444sooqa"
delims = [0] + [i + 1 for i, s in enumerate(str) if not s.isalpha()] + [len(str) + 1]
parts = [str[delims[i]: delims[i + 1] - 1] for i in range(len(delims) - 1) if delims[i + 1] - delims[i] != 1]
Expanded version for better understanding what's going on:
str = "The@Tt11end444sooqa"
# delims will contain indexes of all non-alphabetic characters
delims = [0] # adding 0 index as first delimiter (start of string)
for i, s in enumerate(str): # iterating through "str"
if not s.isalpha(): # if character is non-alphabetic store it's index
delims.append(i + 1) # we add 1 to not include delimiter into final string
delims += [len(str) + 1] # adding end of string index to not miss last part
# parts will contain parts of original string stored in "str"
parts = []
for i in range(len(delims) - 1): #iterating over "delims" using indexes
# do not include part if delimiters goes next one to another
if delims[i + 1] - delims[i] != 1:
substr = str[delims[i]: delims[i + 1] - 1] # copy substring between delimiters
parts.append(substr)
Upvotes: 0
Reputation: 461
Your problem is that you are setting i
to counter
every time, and it is not incrementing past the first non letter.
It will increment each time until range(len(sentence)) is complete, but each loop of the for it will be reset back to the original failure of is_letter, in this case i = 3
.
E.g.
T = 0
h = 1
e = 2
@ = 3 > not a letter
Now variable i
would equate to 4, however variable counter
is still equal to 3 as it was not incremented within the while(is_letter) block. The more appropriate use in this would be an if/else as follows:
def extract_words(sentence):
words_list = []
words = ""
for i in range(len(sentence)):
if is_letter(sentence[i]):
words += sentence[i]
else:
if words != "":
words_list.append(words)
words = ""
if words != "":
words_list.append(words)
return words_list
def is_letter(char):
return ("A" <= char and char <= "Z") or \
("a" <= char and char <= "z")
if __name__ == '__main__':
print(extract_words("The@Tt11end"))
Output:
['The', 'Tt', 'end']
In this setup, the loop will only use i as the incremented variable since it is already a for loop and changing that i value outside of the for context can cause issues, as you have seen.
Next each time the character of the string is a letter, it is added to the word variable. Then if the next increment is a symbol, it will append the word to a list and ignore the symbol/digit.
Finally if two or more symbols are next to each other (which caused you to get a list of empty strings ''
), it will check if words contains any character already, and if not, it will continue on to the next character.
Upvotes: 0
Reputation: 39354
You code is getting in a tangle and is not indexing into the given sentence.
You only need to iterate through the characters in the sentence
def is_letter(char):
return ("A" <= char <= "Z") or ("a" <= char <= "z")
def extract_words(sentence):
word = ""
words_list = []
for ch in sentence:
if is_letter(ch):
word += ch
else:
if word:
words_list.append(word)
word = ""
if word:
words_list.append(word)
return words_list
print(extract_words('The@,Tt11end'))
Output:
['The', 'Tt', 'end']
The code iterates through each char in sentence
. If it is a letter, then it is added to the current word. If not it will add the current word, if there is one, to the output list. Finally, if the last char is a letter, there will be a word left over which is also added to the output.
Upvotes: 1