Reputation: 247
I am trying to generate a dataframe out of a dataset which is in text format. The text file is in the following format
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A1RXYH9ROBAKEZ
review/profileName: A. Igoe
review/helpfulness: 0/0
review/score: 1.0
review/time: 1233360000
review/summary: Don't buy!
review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A7L6E1KSJTAJ6
review/profileName: Steven Martz
review/helpfulness: 0/0
review/score: 5.0
review/time: 1191456000
review/summary: Mobile Action Bluetooth Mobile Phone Tool Software MA-730
review/text: Great product- tried others and this is a ten compared to them. Real easy to use and sync's easily. Definite recommended buy to transfer data to and from your Cell.
So I need to generate a dataframe that has all ProductID, Title, Price etc.. as column titles and the corresponding data in each record.
So the final dataframe which I want is
ID Title Price UserID ProfileName Helpfulness Score Time summary
B000JVER7W Mobile Action MA730 unknown A1RXYH9ROBAKEZ A. Igoe 0/0 1.0 1233360000 Don'tbuy!
Handset Manager - Bluetooth
Data Suite
and so on for all the review details that are in the dataset using regex. As I am beginner to regex I am unable to perform this operation. I tried doing (assuming dataset variable to consist of all the contents of the text file)
pattern = "product\productId:\s(.*)\s"
a = re.search(pattern, dataset)
By doing this I get the ouptput
>> a.group(1)
"B000JVER7W product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite product/price: unknown review/userId: A1RXYH9ROBAKEZ review/profileName: A. Igoe review/helpfulness: 0/0 review/score: 1.0 review/time: 1233360000 review/summary: Dont buy! review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!"
But what I want is
>> a.group(1)
"["B000JVER7W", "A000123js" ...]"
and similarly for all the fields.
Is the above requirement possible, if it is how to do it
Thanks in advance
Upvotes: 1
Views: 62
Reputation: 26
You can do it even without any regex by creating a dictionary and then using pandas.Dataframe()
.
Try this :
import pandas as pd
with open("your_file_name") as file:
product_details = file.read().split("\n\n")
product_dict = {"ID":[],"Title":[],"Price":[],"UserID":[],
"ProfileName":[],"Helpfulness":[],"Score":[],"Time":[],"summary":[]}
for product in product_details:
fields = product.split("\n")
product_dict["ID"].append(fields[0].split(":")[1])
product_dict["Title"].append(fields[1].split(":")[1])
product_dict["Price"].append(fields[2].split(":")[1])
product_dict["UserID"].append(fields[3].split(":")[1])
product_dict["ProfileName"].append(fields[4].split(":")[1])
product_dict["Helpfulness"].append(fields[5].split(":")[1])
product_dict["Score"].append(fields[6].split(":")[1])
product_dict["Time"].append(fields[7].split(":")[1])
product_dict["summary"].append(fields[8].split(":")[1])
dataframe = pd.DataFrame(product_dict)
print(dataframe)
Output
First row would look like this as you wanted :
ID Title Price UserID ProfileName Helpfulness Score Time summary
B000JVER7W Mobile Action MA730 unknown A1RXYH9ROBAKEZ A. Igoe 0/0 1.0 1233360000 Don'tbuy!
Handset Manager - Bluetooth
Data Suite
Upvotes: 1
Reputation: 2407
You have a typo in 'pattern', change '\' into '/'. And use \s* and findall:
pattern = r"product/productId:\s*(.*)\s*"
mo= re.findall(pattern,text)
Upvotes: 0