Reputation: 191
I am trying to read a file which has format like below: It has two '\n' space in between every line.
Great tool for healing your life--if you are ready to change your beliefs!<br /><a href="http
Bought this book for a friend. I read it years ago and it is one of those books you keep forever. Love it!
I read this book many years ago and have heard Louise Hay speak a couple of times. It is a valuable read...
I am using below python code to read the line and convert it into Dataframe:
open_reviews = open("C:\\Downloads\\review_short.txt","r",encoding="Latin-1" ).read()
documents = []
for r in open_reviews.split('\n\n'):
documents.append(r)
df = pd.DataFrame(documents)
print(df.head())
The output I am getting is as below:
0 I was very inspired by Louise's Hay approach t...
1 \n You Can Heal Your Life by
2 \n I had an older version
3 \n I love Louise Hay and
4 \n I thought the book was exellent
Since I used two (\n), it gets appended at beginning of each line. Is there any other way to handle this, so that I get output as below:
0 I was very inspired by Louise's Hay approach t...
1 You Can Heal Your Life by
2 I had an older version
3 I love Louise Hay and
4 I thought the book was exellent
Upvotes: 0
Views: 1535
Reputation: 109546
This appends every non-blank line.
filename = "..."
lines = []
with open(filename) as f:
for line in f:
line = line.strip()
if line:
lines.append(line)
>>> lines
['Great tool for healing your life--if you are ready to change your beliefs!<br /><a href="http',
'Bought this book for a friend. I read it years ago and it is one of those books you keep forever. Love it!',
'I read this book many years ago and have heard Louise Hay speak a couple of times. It is a valuable read...']
lines = pd.DataFrame(lines, columns=['my_text'])
>>> lines
my_text
0 Great tool for healing your life--if you are r...
1 Bought this book for a friend. I read it years...
2 I read this book many years ago and have heard...
Upvotes: 2
Reputation: 4987
Use readlines()
and clean the line with strip()
.
filename = "C:\\Downloads\\review_short.txt"
open_reviews = open(filename, "r", encoding="Latin-1")
documents = []
for r in open_reviews.readlines():
r = r.strip() # clean spaces and \n
if r:
documents.append(r)
Upvotes: 0
Reputation: 1329
Try using the .stip() method. It will remove any unnecessary whitespace characters from the beginning or end of a string.
You can use it like this:
for r in open_review.split('\n\n'):
documents.append(r.strip())
Upvotes: 0