Reputation: 133
Hi there I am not sure how to explain, I have this problem. Currently I have some text as shown below:
picture gallery
see also
adaptation
ecology
extreme environment clothing
extremophile
lexen life in extreme environments
natural environment
references
"extreme environment" microbial life np nd web 16 may 2013
feminism and gis refers to the use of geographic information system gis for feminist research and also how women influence gis at technological stages feminist gis research is aware of power differences in social and economic realms
history
My question is how do i make it till I get the result like this for example:
picture gallery see also adaptation ecology extreme environment clothing extremophile lexen life in extreme environments natural environment references "extreme environment" microbial life np nd web 16 may 2013 feminism and gis refers to the use of geographic information system gis for feminist research and also how women influence gis at technological stages feminist gis research is aware of power differences in social and economic realms
I am not sure what this is called but so far the solution I have found is removing all blank spaces which is not what I need.
Please help me.
Thank you.
Upvotes: 2
Views: 77
Reputation: 3335
Please note, that the result given in the answer is quite creative at the right margin, it dropped history from input data ;-) Update: Latest comment indicates, data comes from file, thus updated answer.
Taking this as a small unwanted glitch, I suggest to neither use regular expressions nor replace. Simply do the strip-split-join transformation in one go like so (assuming the text is in file in.txt
in folder where you invoe the script):
#! /usr/bin/env python
with open('in.txt', 'rt') as f:
filtered = ' '.join(f.read().strip().split())
Or - if already in variable (and with expectation and comparison as minimal test):
#! /usr/bin/env python
text = '''picture gallery
see also
adaptation
ecology
extreme environment clothing
extremophile
lexen life in extreme environments
natural environment
references
"extreme environment" microbial life np nd web 16 may 2013
feminism and gis refers to the use of geographic information system gis for feminist research and also how women influence gis at technological stages feminist gis research is aware of power differences in social and economic realms
history
'''
expected = (
'picture gallery see also adaptation ecology extreme environment'
' clothing extremophile lexen life in extreme environments'
' natural environment references "extreme environment" microbial'
' life np nd web 16 may 2013 feminism and gis refers to the use'
' of geographic information system gis for feminist research and'
' also how women influence gis at technological stages feminist'
' gis research is aware of power differences in social and'
' economic realms history')
filtered = ' '.join(text.strip().split())
assert filtered == expected
And in case you need a newline at the end of that "one line" result, you could write instead:
filtered = '%s\n' % (' '.join(text.strip().split()),)
or
filtered = ' '.join(text.strip().split()) + '\n'
In that case of course the assert or expected variable should be changed in sync ;-)
This should be also a logically clear solution. Regular expressions are often tempting, but if the result is feasible with simple split-join pipelines like this one, they induce some runtime complexity (and another language embedded).
Just measure with above setup and an adapted one for the regex:
print 'strip-split-join: ', ['%0.4f' % round(z, 4) for z in timeit.Timer("filtered = ' '.join(text.strip().split())", setup=setup).repeat(7, 1000)]
print 're.sub("\s+", " "):', ['%0.4f' % round(z, 4) for z in timeit.Timer("filtered = replaced = re.sub('\s+', ' ', text)", setup=setup_re).repeat(7, 1000)]
this gives (on my machine):
strip-split-join: ['0.0043', '0.0045', '0.0047', '0.0046', '0.0043', '0.0040', '0.0045']
re.sub("\s+", " "): ['0.0265', '0.0254', '0.0246', '0.0248', '0.0238', '0.0255', '0.0266']
so the regex solution is slower by approx. a factor of 5.
Upvotes: 2
Reputation: 5115
If it's a file:
text = file('path/to/your/file.txt').read()
new_text = text.replace('\n', ' ')
print(new_text) # this will have no new lines
with open('output.txt', 'wr') as out:
out.write(new_text) #this will write it to a file
You could also use regex, like PJSCopeland said:
import re
s = "Example String \n more example string"
replaced = re.sub('\s+', ' ', s)
print replaced
Dilettant's solution is concise, correct and also faster than using regex (by my measure), so I recommend that as a best solution:
filtered = ' '.join(text.strip().split())
Upvotes: 4
Reputation: 3006
Replace /\s+/g
(every instance of at least one white-space character) with " "
. (I'm not familiar with Python, unfortunately, so I don't know what the method call would be.)
Upvotes: 1