Reputation: 61
I need to make a list of every pair of words sequentially in a string with a regex, the relevant part of the code is this:
for word in re.findall(r'\w+\b.*?\w+', text):
Now let's take as example the text "This is a random text", what i want is a list like this:
['This is','is a','a random','random text']
Instead what I'm getting is this:
['This is','a random']
How can i fix this? Thanks in advance.
Upvotes: 1
Views: 1345
Reputation: 19252
A regular expression isn't necessary here.
You can use itertools.pairwise
if you're on Python 3.10 to get all the bigrams. It will output an iterable of 2-tuples, with one word per element in the tuple. You can then use " ".join()
to turn each tuple into a string:
from itertools import pairwise
data = "This is a random text"
print([" ".join(bigram) for bigram in pairwise(data.split(" "))])
This outputs:
['This is', 'is a', 'a random', 'random text']
Upvotes: 0
Reputation: 14741
You said that the words are separated by a random amount of spaces and/or punctuation,
I used [\s\.]+
for that.
what you are doing wrong here is that you are consuming the second word, what you need is a positive lookahead that match the second word but don't consume it, so it will match it next time.
and because you said it's a massive Text I think using finditer
is better than findall
the difference is that it return a generator that produce the same elements returned by findall
:
import re
text ="""This. is a random text"""
pattern = re.compile(r'(\w+[\s\.]+)(?=(\w+))')
for match in pattern.finditer(text):
# rebuild the word
element = ''.join(match.groups())
print(element)
Ouput:
This. is
is a
a random
random text
Note that by default positive lookahead is not a capturing group this why a I did this (?=(\w+))
to capture the word inside it. First group is (\w+[\s\.]+)
. and I used join
to rebuild concatenate the groups again.
Upvotes: 2
Reputation: 5757
But do you really need regex? You can do that without regex
L1 = line.split(' ')
L2 = L1[1:].append(' ')
Result = [' '.join(a,b) for a,b in zip(L1,L2)]
Using Regex but the result is not in order
>>> pattern1 = re.compile(r"(\w+\s+\w+)")
>>> pattern2 = re.compile(r"(\s+\w+\s+\w+)")
>>> l1 = re.findall(pattern1, line)
>>> l2 =[x.strip() for x in re.findall(pattern2, line)]
>>> l1
['This is', 'a random']
>>> l2
['is a', 'random text']
>>> l1 + l2
['This is', 'a random', 'is a', 'random text']
Upvotes: 0
Reputation: 5461
you don't need to used regex for this case you can just use split
st = "This is a random text"
sp = st.split()
result = [f"{w1} {w2}" for w1, w2 in zip(sp, sp[1:])]
print(result)
result
['This is', 'is a', 'a random', 'random text']
Edit
For large data you can implement generator. like pseudo code below
def get_pair_from_large_text():
tail_of_last_chunk = ""
while True
chunk = get_string_chunk_from_source()
if len(chunk)==0:
yield f"{words[-2]} {words[-1]}"
break
chunk = tail_of_last_chunk[1] + chunk
words = split(chunk)
tail_of_last_chunk = words[-2], words[-1]
for w1, w2 in zip(words[:-1], words[1:-1])
yield f"{w1} {w2}"
Upvotes: 0
Reputation: 5859
If you want to use regex for this task, take a look at this:
(\w+)\s+(?=(\w+))
The trick is to use positive lookahead for the second word and capture it within a group. In order to output the resulting pairs, combine the result of Group 1 and Group 2 matches.
Upvotes: 1
Reputation: 618
Typically I don't think the same RegEx allows for overlapping search results. What you might want to do instead is find the intermediate spaces and find the words that are just before and just after the space.
Upvotes: 0