Vikas Kumar Ojha
Vikas Kumar Ojha

Reputation: 104

How to extract all the sentences with review/text in the below text?

Here I want to extract the review/text.But its extracting only few parts from it. Following are the outputs:- <re.Match object; span=(226, 258), match='review/text: I like Creme Brulee'> <re.Match object; span=(750, 860), match='review/text: not what I was expecting in terms of>

import re

text='''
'product/productId: B004K2IHUO\n',
 'review/userId: A2O9G2521O626G\n',
 'review/profileName: Rachel Westendorf\n',
 'review/helpfulness: 0/0\n',
 'review/score: 5.0\n',
 'review/time: 1308700800\n',
 'review/summary: The best\n',
 'review/text: I like Creme Brulee. I loved that these were so easy. Just sprinkle on the sugar that came with and broil. They look amazing and taste great. My guess thought I really went out of the way for them when really it took all of 5 minutes. I will be ordering more!\n',
 '\n',
 'product/productId: B004K2IHUO\n',
 'review/userId: A1ZKFQLHFZAEH9\n',
 'review/profileName: S. J. Monson "world citizen"\n',
 'review/helpfulness: 2/8\n',
 'review/score: 3.0\n',
 'review/time: 1236384000\n',
 'review/summary: disappointing\n',
 "review/text: not what I was expecting in terms of the company's reputation for excellent home delivery products\n",
 '\n',
'''

pattern=re.compile(r'review/text:\s[^.]+')
matches=pattern.finditer(text)

for match in matches:
  print(match)

Upvotes: 1

Views: 324

Answers (2)

Ryszard Czech
Ryszard Czech

Reputation: 18611

Use

matches = re.findall(r'review/text:.+', text)

See proof.

EXPLANATION

--------------------------------------------------------------------------------
  review/text:             'review/text:'
--------------------------------------------------------------------------------
  .+                       any character except \n (1 or more times
                           (matching the most amount possible))

Upvotes: 0

Tim Jim
Tim Jim

Reputation: 670

If you don't mind not using re and if the identifier is 'review/text' and your data is always comma seperated, you can get the lines simply with:

matches = [s.strip() for s in text.split(',') if s.strip(' "\n\'').startswith('review/text')]
for match in matches:
  print(match)

where s.strip(' "\'\n') removes spaces, ", ', and newline characters from the beginning and ends of the line for a string comparison.These two lines are returned:

'review/text: I like Creme Brulee. I loved that these were so easy. Just sprinkle on the sugar that came with and broil. They look amazing and taste great. My guess thought I really went out of the way for them when really it took all of 5 minutes. I will be ordering more!
'
"review/text: not what I was expecting in terms of the company's reputation for excellent home delivery products
"

Upvotes: 1

Related Questions