Reputation: 85036
I am trying to create a regex that will take a longish string that contains space separated words and break it into chunks of up to 50 characters that end with a space or the end of the line.
I first came up with: (.{0,50}(\s|$))
but that only grabbed the first match. I then thought I would add a *
to the end: (.{0,50}(\s|$))*
but now it grabs the entire string.
I've been testing here, but can't seem to to get it to work as needed. Can anyone see what I am doing wrong here?
Upvotes: 0
Views: 78
Reputation: 4418
It's not using a regex, but have you thought about using textwrap.wrap()
?
In [8]: import textwrap
text = ' '.join([
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed et convallis",
"lectus. Quisque maximus diam ut sodales tincidunt. Integer ac finibus",
"elit. Etiam tristique euismod justo, vel pretium tellus malesuada et.",
"Pellentesque id mattis eros, at bibendum mauris. In luctus lorem eget nisl",
"sagittis sollicitudin. Aenean consequat at lacus at porttitor. Nunc sit",
"amet neque eu sem venenatis rutrum. Proin sed tempus lacus, sit amet porta",
"velit. Suspendisse et semper nisl, eu varius orci. Ut non metus."])
In [9]: textwrap.wrap(text, 50)
Out[9]: ['Lorem ipsum dolor sit amet, consectetur adipiscing',
'elit. Sed et convallis lectus. Quisque maximus',
'diam ut sodales tincidunt. Integer ac finibus',
'elit. Etiam tristique euismod justo, vel pretium',
'tellus malesuada et. Pellentesque id mattis eros,',
'at bibendum mauris. In luctus lorem eget nisl',
'sagittis sollicitudin. Aenean consequat at lacus',
'at porttitor. Nunc sit amet neque eu sem venenatis',
'rutrum. Proin sed tempus lacus, sit amet porta',
'velit. Suspendisse et semper nisl, eu varius orci.',
'Ut non metus.']
Upvotes: 0
Reputation: 1397
Here, it seems to be working:
import re
p = re.compile(ur'(.{0,50}[\s|$])')
test_str = u"jasdljasjdlk jal skdjl ajdl kajsldja lksjdlkasd jas lkjdalsjdalksjdalksjdlaksjdk sakdjakl jd fgdfgdfg\nhgkjd fdkfhgk dhgkjhdfhg kdhfgk jdfghdfkjghjf dfjhgkdhf hkdfhgkj jkdfgk jfgkfg dfkghk hdfkgh d asdada \ndkjfghdkhg khdfkghkd hgkdfhgkdhfk k dfghkdfgh dfgdfgdfgd\n"
re.findall(p, test_str)
Upvotes: 1
Reputation: 986
Here's what you need - '[^\s]{1,50}'. Example on smaller number:
>>> text = "Lorem ipsum sit dolor"
>>> splitter = re.compile('[^\s]{1,3}')
>>> splitter.findall(text)
['Lor', 'em', 'ips', 'um', 'sit', 'dol', 'or']
Upvotes: -1
Reputation: 874
What are you using to match the regex? The re.findall()
method should return what you want.
Upvotes: 1