Christopher Flach
Christopher Flach

Reputation: 85

Python Read Specific Data from Text File

I'm struggling with trying to grasp this. I need to create a pandas DataFrame object with the following entries for each review:

If anyone can even just help me get started on how to print every product/productID line, that would be appreciated.

Here's a sample of my text file: (sorry I don't know how to properly format it when I enter it on this site)

product/productId: B001E4KFG0
review/userId: A3SGXH7AUHU8GW
review/profileName: delmartian
review/helpfulness: 1/1
review/score: 5.0
review/time: 1303862400
review/summary: Good Quality Dog Food
review/text: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.

product/productId: B00813GRG4
review/userId: A1D87F6ZCVE5NK
review/profileName: dll pa
review/helpfulness: 0/0
review/score: 1.0
review/time: 1346976000
review/summary: Not as Advertised
review/text: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".

product/productId: B000LQOCH0
review/userId: ABXLMWJIXXAIN
review/profileName: Natalia Corres "Natalia Corres"
review/helpfulness: 1/1
review/score: 4.0
review/time: 1219017600
review/summary: "Delight" says it all
review/text: This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.

Upvotes: 0

Views: 1535

Answers (2)

wwii
wwii

Reputation: 23753

Here is a start, I couldn't decipher a couple of your fields/columns so it might need more logic and text massaging. Similar to other answers: parse the text into dictionary key:value pairs - using a regular expression find the pairs.

import collections, re

fields = {'productId':'Product ID', 'score':'Rating',
          'helpfulness':'Number Voting', 'text':'Review'}

pattern = r'/([^:]*):\s?(.*)'
kv = re.compile(pattern)

data = collections.defaultdict(list)
with open('file.txt') as f:
    reviews = f.read()

for match in kv.finditer(reviews):
    key, value = match.groups()
    if key in fields:
        data[fields[key]].append(value)

df = pd.DataFrame.from_dict(data)

Upvotes: 0

Joabe da Luz
Joabe da Luz

Reputation: 1020

If I understood your question I believe you want to read from a file with the structure you wrote. You can use the following code that will create an array with every review being a dictionary:

#Opening your file
your_file = open('file.txt')

#Reading every line
reviews = your_file.readlines()

reviews_array = []
dictionary = {}

#We are going through every line and skip it when we see that it's a blank line
for review in reviews:
    this_line = review.split(":")
    if len(this_line) > 1:
        #The blank lines are less than 1 in length after the split
        dictionary[this_line[0]] = this_line[1].strip()
        #Every first part before ":" is the key of the dictionary, and the second part id the content.
    else:
        #If a blank like was found lets save the object in the array and reset it
        #for the next review
        reviews_array.append(dictionary)
        dictionary = {}

#Append the last object because it goes out the last else
reviews_array.append(dictionary)

print(reviews_array)

This code will print something like this:

[
{'review/text': 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.', 'review/profileName': 'delmartian', 'review/summary': 'Good Quality Dog Food', 'product/productId': 'B001E4KFG0', 'review/score': '5.0', 'review/time': '1303862400', 'review/helpfulness': '1/1', 'review/userId': 'A3SGXH7AUHU8GW'},
{'review/text': 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".', 'review/profileName': 'dll pa', 'review/summary': 'Not as Advertised', 'product/productId': 'B00813GRG4', 'review/score': '1.0', 'review/time': '1346976000', 'review/helpfulness': '0/0', 'review/userId': 'A1D87F6ZCVE5NK'},
{'review/text': 'bla blas', 'review/profileName': 'Natalia Corres "Natalia Corres"', 'review/summary': '"Delight" says it all', 'product/productId': 'B000LQOCH0', 'review/score': '4.0', 'review/time': '1219017600', 'review/helpfulness': '1/1', 'review/userId': 'ABXLMWJIXXAIN'}
]

You can access every object like this:

for r in reviews_array:
    print(r['review/userId'])

And then you will have this result:

A3SGXH7AUHU8GW
A1D87F6ZCVE5NK
ABXLMWJIXXAIN

Upvotes: 2

Related Questions