Srikanth
Srikanth

Reputation: 247

How to capture and separate the texts using regex in python

I am trying to generate a dataframe out of a dataset which is in text format. The text file is in the following format

product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A1RXYH9ROBAKEZ
review/profileName: A. Igoe
review/helpfulness: 0/0
review/score: 1.0
review/time: 1233360000
review/summary: Don't buy!
review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!

product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A7L6E1KSJTAJ6
review/profileName: Steven Martz
review/helpfulness: 0/0
review/score: 5.0
review/time: 1191456000
review/summary: Mobile Action Bluetooth Mobile Phone Tool Software MA-730
review/text: Great product- tried others and this is a ten compared to them. Real easy to use and sync's easily. Definite recommended buy to transfer data to and from your Cell.

So I need to generate a dataframe that has all ProductID, Title, Price etc.. as column titles and the corresponding data in each record.

So the final dataframe which I want is

ID          Title                        Price      UserID          ProfileName     Helpfulness     Score   Time        summary
B000JVER7W  Mobile Action MA730          unknown    A1RXYH9ROBAKEZ  A. Igoe         0/0             1.0     1233360000  Don'tbuy!               
            Handset Manager - Bluetooth 
            Data Suite

and so on for all the review details that are in the dataset using regex. As I am beginner to regex I am unable to perform this operation. I tried doing (assuming dataset variable to consist of all the contents of the text file)

pattern = "product\productId:\s(.*)\s"
a = re.search(pattern, dataset)

By doing this I get the ouptput

>> a.group(1)
 "B000JVER7W product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite product/price: unknown review/userId: A1RXYH9ROBAKEZ review/profileName: A. Igoe review/helpfulness: 0/0 review/score: 1.0 review/time: 1233360000 review/summary: Dont buy! review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!"

But what I want is

>> a.group(1)
"["B000JVER7W", "A000123js" ...]"

and similarly for all the fields.

Is the above requirement possible, if it is how to do it

Thanks in advance

Upvotes: 1

Views: 62

Answers (2)

Adarsh Pai
Adarsh Pai

Reputation: 26

You can do it even without any regex by creating a dictionary and then using pandas.Dataframe().

Try this :

import pandas as pd

with open("your_file_name") as file:
    product_details = file.read().split("\n\n")

product_dict = {"ID":[],"Title":[],"Price":[],"UserID":[],
                "ProfileName":[],"Helpfulness":[],"Score":[],"Time":[],"summary":[]}

for product in product_details:
    fields = product.split("\n")
    product_dict["ID"].append(fields[0].split(":")[1])
    product_dict["Title"].append(fields[1].split(":")[1])
    product_dict["Price"].append(fields[2].split(":")[1])
    product_dict["UserID"].append(fields[3].split(":")[1])
    product_dict["ProfileName"].append(fields[4].split(":")[1])
    product_dict["Helpfulness"].append(fields[5].split(":")[1])
    product_dict["Score"].append(fields[6].split(":")[1])
    product_dict["Time"].append(fields[7].split(":")[1])
    product_dict["summary"].append(fields[8].split(":")[1])

dataframe = pd.DataFrame(product_dict)
print(dataframe)

Output

First row would look like this as you wanted :

ID          Title                        Price      UserID          ProfileName     Helpfulness     Score   Time        summary
B000JVER7W  Mobile Action MA730          unknown    A1RXYH9ROBAKEZ  A. Igoe         0/0             1.0     1233360000  Don'tbuy!               
            Handset Manager - Bluetooth 
            Data Suite

Upvotes: 1

kantal
kantal

Reputation: 2407

You have a typo in 'pattern', change '\' into '/'. And use \s* and findall:

pattern = r"product/productId:\s*(.*)\s*"
mo= re.findall(pattern,text)

Upvotes: 0

Related Questions