Reputation:
I have a list of strings that looks something like this:
list_strings = ["The", "11:2dog", "is", "2:33", "a22:11", "german", "shepherd.2:2"]
Here is what I want to do:
For each string in the list, I want to remove the numbers that match the pattern number:number
. This pattern will always be at the beginning or the end of the string.
When the pattern is removed from the string, I want to insert it as as the next element of the list if it is at the end, or as the previous element of the list if it is at the beginning of the string.
So:
list_strings = ["The", "11:2dog", "is", "2:33", "a22:11", "german", "shepherd.2:2"]
becomes:
new_list_strings = ["The", "11:2", "dog", "is", "2:33", "a", "22:11", "german", "shepherd.", "2:2"]
To find the words that may contain the pattern, I have tried using regular expressions:
for index, word in enumerate(list_strings):
try:
if re.search(r'\d+:\d+', word).group() != None:
words_with_pattern.append([index], word)
except:
pass
However, this only finds instances where the pattern is alone like "11:21". Once I have a list of all the words with the pattern, I will have to remove the pattern from the strings, note whether it is at the beginning or at the end, and insert it at the corresponding index in the list.
Any help? Thanks!
Upvotes: 3
Views: 2982
Reputation: 22837
This method uses re.findall
to get all matches in a string and then combines the results into one list.
The regex \d+:\d+|(?:(?!\d+:\d+).)+
works as follows:
\d+:\d+
Matches one or more digits, followed by :
, then one or more digits(?:(?!\d+:\d+).)+
This is a tempered greedy token that matches any character one or more times except where \d+:\d+
matches. This forces it to stop matching at that location and the findall
method retries to match that that new location (now matching the \d+:\d+
pattern option instead resulting in multiple matches per string)The following code is much easier to read than Method 2.
import re
ls = ["The", "11:2dog", "is", "2:33", "a22:11", "german", "shepherd.2:2"]
newls = []
for s in ls:
newls += re.findall(r"\d+:\d+|(?:(?!\d+:\d+).)+", s)
print(newls)
This makes the code from Method 1 a one-liner, but it's harder to read. The method used to flatten the list sum(l,[])
is taken from this answer.
import re
ls = ["The", "11:2dog", "is", "2:33", "a22:11", "german", "shepherd.2:2"]
print(sum([re.findall(r"\d+:\d+|(?:(?!\d+:\d+).)+", s) for s in ls], []))
['The', '11:2', 'dog', 'is', '2:33', 'a', '22:11', 'german', 'shepherd.', '2:2']
Upvotes: 0
Reputation: 24281
You can use re.split:
import re
list_strings = ["The", "11:2dog", "is", "2:33", "a22:11", "german", "shepherd.2:2"]
out = []
for item in list_strings:
split = re.split(r'(\d+:\d+)', item)
out.extend([part for part in split if part])
print(out)
# ['The', '11:2', 'dog', 'is', '2:33', 'a', '22:11', 'german', 'shepherd.', '2:2']
split
will contain the parts of the string and the separator, as we captured it in the regex.
It also contains empty strings after/before the separator if it was at the end/start of the string, so we have to remove them before extending the output.
As @chrisz suggested in the comments, this can be written in a much more compact form using a list comprehension:
[j for i in list_strings for j in re.split(r'(\d+:\d+)', i) if j]
Upvotes: 1