Abe Miessler
Abe Miessler

Reputation: 85036

Breaking text into chunks using regex?

I am trying to create a regex that will take a longish string that contains space separated words and break it into chunks of up to 50 characters that end with a space or the end of the line.

I first came up with: (.{0,50}(\s|$)) but that only grabbed the first match. I then thought I would add a * to the end: (.{0,50}(\s|$))* but now it grabs the entire string.

I've been testing here, but can't seem to to get it to work as needed. Can anyone see what I am doing wrong here?

Upvotes: 0

Views: 78

Answers (4)

RootTwo
RootTwo

Reputation: 4418

It's not using a regex, but have you thought about using textwrap.wrap()?

In [8]: import textwrap

        text = ' '.join([
           "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed et convallis",
           "lectus. Quisque maximus diam ut sodales tincidunt. Integer ac finibus",
           "elit. Etiam tristique euismod justo, vel pretium tellus malesuada et.",
           "Pellentesque id mattis eros, at bibendum mauris. In luctus lorem eget nisl",
           "sagittis sollicitudin. Aenean consequat at lacus at porttitor. Nunc sit",
           "amet neque eu sem venenatis rutrum. Proin sed tempus lacus, sit amet porta",
           "velit. Suspendisse et semper nisl, eu varius orci. Ut non metus."])

In [9]: textwrap.wrap(text, 50)
Out[9]: ['Lorem ipsum dolor sit amet, consectetur adipiscing',
        'elit. Sed et convallis lectus. Quisque maximus',
        'diam ut sodales tincidunt. Integer ac finibus',
        'elit. Etiam tristique euismod justo, vel pretium',
        'tellus malesuada et. Pellentesque id mattis eros,',
        'at bibendum mauris. In luctus lorem eget nisl',
        'sagittis sollicitudin. Aenean consequat at lacus',
        'at porttitor. Nunc sit amet neque eu sem venenatis',
        'rutrum. Proin sed tempus lacus, sit amet porta',
        'velit. Suspendisse et semper nisl, eu varius orci.',
        'Ut non metus.']

Upvotes: 0

Walter_Ritzel
Walter_Ritzel

Reputation: 1397

Here, it seems to be working:

import re
p = re.compile(ur'(.{0,50}[\s|$])')
test_str = u"jasdljasjdlk jal skdjl ajdl kajsldja lksjdlkasd jas lkjdalsjdalksjdalksjdlaksjdk sakdjakl jd fgdfgdfg\nhgkjd fdkfhgk dhgkjhdfhg kdhfgk jdfghdfkjghjf dfjhgkdhf hkdfhgkj jkdfgk jfgkfg dfkghk hdfkgh d asdada \ndkjfghdkhg khdfkghkd hgkdfhgkdhfk k dfghkdfgh dfgdfgdfgd\n"

re.findall(p, test_str)

Upvotes: 1

Sergius
Sergius

Reputation: 986

Here's what you need - '[^\s]{1,50}'. Example on smaller number:

>>> text = "Lorem ipsum sit dolor"
>>> splitter = re.compile('[^\s]{1,3}')
>>> splitter.findall(text)
['Lor', 'em', 'ips', 'um', 'sit', 'dol', 'or']

Upvotes: -1

Chuck
Chuck

Reputation: 874

What are you using to match the regex? The re.findall() method should return what you want.

Upvotes: 1

Related Questions