Parsing txt file in python where it is hard to split by delimiter

Question

I am new to python, and am wondering if anyone can help me with some file loading.

Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: , ,

Here are some sample data:

men peter123 the pants are too tight for my liking! 
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills 
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working

I want to make into this

I have 50k lines of these data.

I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.

My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories , , . Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!

Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?

moooeeeep · Accepted Answer

With data like this I'd just use split() with the maxplit argument:

If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).

Example:

from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking! 
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills 
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")

for line in s:
    category, user, review = line.split(None, 2)
    print ("category: {} - user: {} - review: '{}'".format(category,
                                                           user,
                                                           review.strip()))

The output is:

category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'

For reference:

https://docs.python.org/2/library/stdtypes.html#str.split

Parsing txt file in python where it is hard to split by delimiter

Answers (2)

Related Questions