Reputation: 227
I'm trying to write custom tokenizer:
print(re.sub(' ',"\n",(re.sub('\\{|\\}|\\[|\\]|\\\\|\\/|\\\"|\\\'|\\,|\\=|\\(|\\)|\\:|\\||\\-|\\*|\\!|\\;|\\<|\\>|\\,|\\?|//@'," ",str))))
Output:
America
Category
States
of
the
United
States
Category
Southern
United
States
Link
FA
mk
Many new lines being inserted. I'm trying to write an optimized code to remove all empty lines with regular expressions without going into each and everydetails. I'm really worried about the performance of the program. I've lines over 100 Billion. So, I'm bit worried about time of execution. Any suggessions?
I'm trying to make output as below:
America
Category
States
of
the
United
States
Category
Southern
United
States
Link
FA
mk
Upvotes: 1
Views: 83
Reputation: 34146
You can use join()
and split()
methods:
print " ".join(your_string.split())
Output:
America Category States of the United States Category Southern United States Link FA mk
Edit:
To get each word in a different line, use "\n"
instead of " "
:
print "\n".join(a.split())
Upvotes: 4