Zarok
Zarok

Reputation: 61

Python regex findall matching all pairs of words

I need to make a list of every pair of words sequentially in a string with a regex, the relevant part of the code is this:

for word in re.findall(r'\w+\b.*?\w+', text):

Now let's take as example the text "This is a random text", what i want is a list like this:

['This is','is a','a random','random text']

Instead what I'm getting is this:

['This is','a random']

How can i fix this? Thanks in advance.

Upvotes: 1

Views: 1345

Answers (6)

BrokenBenchmark
BrokenBenchmark

Reputation: 19252

A regular expression isn't necessary here.

You can use itertools.pairwise if you're on Python 3.10 to get all the bigrams. It will output an iterable of 2-tuples, with one word per element in the tuple. You can then use " ".join() to turn each tuple into a string:

from itertools import pairwise

data = "This is a random text"
print([" ".join(bigram) for bigram in pairwise(data.split(" "))])

This outputs:

['This is', 'is a', 'a random', 'random text']

Upvotes: 0

Charif DZ
Charif DZ

Reputation: 14741

You said that the words are separated by a random amount of spaces and/or punctuation, I used [\s\.]+ for that.

what you are doing wrong here is that you are consuming the second word, what you need is a positive lookahead that match the second word but don't consume it, so it will match it next time. and because you said it's a massive Text I think using finditer is better than findall the difference is that it return a generator that produce the same elements returned by findall:

import re

text ="""This. is a random text"""

pattern = re.compile(r'(\w+[\s\.]+)(?=(\w+))')
for match in pattern.finditer(text):
    # rebuild the word
    element = ''.join(match.groups())
    print(element)

Ouput:

This. is
is a
a random
random text

Note that by default positive lookahead is not a capturing group this why a I did this (?=(\w+)) to capture the word inside it. First group is (\w+[\s\.]+). and I used join to rebuild concatenate the groups again.

Upvotes: 2

abhilb
abhilb

Reputation: 5757

But do you really need regex? You can do that without regex

L1 = line.split(' ')
L2 = L1[1:].append(' ')
Result = [' '.join(a,b) for a,b in zip(L1,L2)]

Using Regex but the result is not in order

>>> pattern1 = re.compile(r"(\w+\s+\w+)")
>>> pattern2 = re.compile(r"(\s+\w+\s+\w+)")
>>> l1 = re.findall(pattern1, line)
>>> l2 =[x.strip() for x in re.findall(pattern2, line)]
>>> l1
['This is', 'a random']
>>> l2
['is a', 'random text']
>>> l1 + l2
['This is', 'a random', 'is a', 'random text']

Upvotes: 0

Dev Khadka
Dev Khadka

Reputation: 5461

you don't need to used regex for this case you can just use split

st = "This is a random text"
sp = st.split()

result = [f"{w1} {w2}" for w1, w2 in zip(sp, sp[1:])]
print(result)

result

['This is', 'is a', 'a random', 'random text']

Edit

For large data you can implement generator. like pseudo code below

def get_pair_from_large_text():
    tail_of_last_chunk = ""
    while True
        chunk = get_string_chunk_from_source()
        if len(chunk)==0:
            yield f"{words[-2]} {words[-1]}"
            break
        chunk = tail_of_last_chunk[1] + chunk

        words = split(chunk)
        tail_of_last_chunk = words[-2], words[-1]

        for w1, w2 in zip(words[:-1], words[1:-1])
            yield f"{w1} {w2}"


Upvotes: 0

vs97
vs97

Reputation: 5859

If you want to use regex for this task, take a look at this:

(\w+)\s+(?=(\w+))

Regex Demo

The trick is to use positive lookahead for the second word and capture it within a group. In order to output the resulting pairs, combine the result of Group 1 and Group 2 matches.

Upvotes: 1

Prometheus
Prometheus

Reputation: 618

Typically I don't think the same RegEx allows for overlapping search results. What you might want to do instead is find the intermediate spaces and find the words that are just before and just after the space.

Upvotes: 0

Related Questions