Reputation: 5
I am receiving a stream of tweets with python and would like to extract the last word or know where to reference it.
for example in
NC don’t like working together www.linktowtweet.org
get back
together
Upvotes: 0
Views: 128
Reputation: 1734
I am not familiar with tweepy, so I am presuming you have the data in a python string, so maybe there is a better answer.
However, given a string in python, it simple to extract the last word.
Solution 1
Use str.rfind(' ')
. The idea here is to find the space, preceding the last word. Here is an example.
text = "NC don’t like working together"
text = text.rstrip() # To any spaces at the end, that would otherwise confuse the algorithm.
last_word = text[text.rfind(' ')+1:] # Output every character *after* the space.
print(last_word)
Note: If a string is given with no words, last_word
will be a blank string.
Now this presumes that all of the words are separated by spaces. To handle newlines and spaces, use str.replace
to turn them into strings. Whitespaces in python are \t\n\x0b\x0c\r
, but I presume only newlines and tabs will be found in twitter messages.
Also see: string.whitespace
So a complete example (wrapped as a function) would be
def last_word(text):
text = text.replace('\n', ' ') # Replace newlines with spaces.
text = text.replace('\t', ' ') # Replace tabs with spaces.
text = text.rstrip(' ') # Remove trailing spaces.
return text[text.rfind(' ')+1:]
print(last_word("NC don’t like working together")) # Outputs "together".
This may still be the best situation for basic parsing. There is something better for larger problems.
Solution 2
Regular Expressions
These are a way to handle strings in python, that is a lot more flexible. REGEX, as they are often called, use there own language to specify a portion of text.
For example, .*\s(\S+)
specifies the last word in a string.
Here is it again with a longer explanation.
.* # Match as many characters as possible.
\s # Until a whitespace ("\t\n\x0b\x0c\r ")
( # Remember the next section for the answer.
\S+ # Match a ~word~ (not whitespace) as possible.
) # End saved section.
So then, in python you would use this as follows.
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s(\S+)", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together")) # Outputs "together".
Now, even though this method is a lot less obvious, it has a couple of advantages. First off, it is a lot more customizable. If you wanted to match the final word, but not links, the regex r".*\s([^.:\s]+(?!\.\S|://))\b"
would match the last word, but ignore a link if that was the last thing.
Example:
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together www.linktowtweet.org")) # Outputs "together".
The second advantage to this method is speed.
As you can Try it online! here, the regex approach is almost as fast as the string manipulation, if not faster in some cases. (I actually found that regex execute .2 usec faster on my machine that in the demo.)
Either way, the regex execution is extremely fast, even in the simple case, and there is no question that the regex is faster then any more complex string algorithm implemented in python. So using the regex can also speed up the code.
EDIT Changed the url avoiding regex from
re.compile(r".*\s([^.\s]+(?!\.\S))\b", re.DOTALL)
to
re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
So that calling last_word("NC don’t like working together http://www.linktowtweet.org")
returns together
and not http://
.
To so how this regex works, look at https://regex101.com/r/sdwpqB/2.
Upvotes: 1
Reputation: 711
Simple, so if your text is:
text = "NC don’t like working together www.linktowtweet.org"
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) #to remove any URL
text = text.split() #splits sentence into words with delimiter=" "
last_word = text[-1]
So there you go!! Now you'll get the last word "together".
Upvotes: 0