Reputation: 85
I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>
, <user>
, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>.
Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review
?
Upvotes: 3
Views: 94
Reputation: 32522
With data like this I'd just use split()
with the maxplit
argument:
If
maxsplit
is given, at mostmaxsplit
splits are done (thus, the list will have at mostmaxsplit+1
elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
Upvotes: 5
Reputation: 6295
What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])
Upvotes: 1