windboy
windboy

Reputation: 133

Remove paragraphs and saving everything to one line

Hi there I am not sure how to explain, I have this problem. Currently I have some text as shown below:

 picture gallery 
    see also 
    adaptation
    ecology
    extreme environment clothing
    extremophile
    lexen life in extreme environments
    natural environment
    references 
    "extreme environment" microbial life np nd web 16 may 2013
    feminism and gis refers to the use of geographic information system gis for feminist research and also how women influence gis at technological stages feminist gis research is aware of power differences in social and economic realms
     history 

My question is how do i make it till I get the result like this for example:

picture gallery see also adaptation ecology extreme environment clothing extremophile lexen life in extreme environments natural environment references "extreme environment" microbial life np nd web 16 may 2013 feminism and gis refers to the use of geographic information system gis for feminist research and also how women influence gis at technological stages feminist gis research is aware of power differences in social and economic realms

I am not sure what this is called but so far the solution I have found is removing all blank spaces which is not what I need.

Please help me.

Thank you.

Upvotes: 2

Views: 77

Answers (3)

Dilettant
Dilettant

Reputation: 3335

Please note, that the result given in the answer is quite creative at the right margin, it dropped history from input data ;-) Update: Latest comment indicates, data comes from file, thus updated answer.

Taking this as a small unwanted glitch, I suggest to neither use regular expressions nor replace. Simply do the strip-split-join transformation in one go like so (assuming the text is in file in.txt in folder where you invoe the script):

#! /usr/bin/env python

with open('in.txt', 'rt') as f:
    filtered = ' '.join(f.read().strip().split())

Or - if already in variable (and with expectation and comparison as minimal test):

#! /usr/bin/env python

text = '''picture gallery 
    see also 
    adaptation
    ecology
    extreme environment clothing
    extremophile
    lexen life in extreme environments
    natural environment
    references 
    "extreme environment" microbial life np nd web 16 may 2013
    feminism and gis refers to the use of geographic information system gis for feminist research and also how women influence gis at technological stages feminist gis research is aware of power differences in social and economic realms
     history 
'''

expected = (
    'picture gallery see also adaptation ecology extreme environment'
    ' clothing extremophile lexen life in extreme environments'
    ' natural environment references "extreme environment" microbial'
    ' life np nd web 16 may 2013 feminism and gis refers to the use'
    ' of geographic information system gis for feminist research and'
    ' also how women influence gis at technological stages feminist'
    ' gis research is aware of power differences in social and'
    ' economic realms history')

filtered = ' '.join(text.strip().split())

assert filtered == expected

And in case you need a newline at the end of that "one line" result, you could write instead:

filtered = '%s\n' % (' '.join(text.strip().split()),)

or

filtered = ' '.join(text.strip().split()) + '\n'

In that case of course the assert or expected variable should be changed in sync ;-)

This should be also a logically clear solution. Regular expressions are often tempting, but if the result is feasible with simple split-join pipelines like this one, they induce some runtime complexity (and another language embedded).

Just measure with above setup and an adapted one for the regex:

print 'strip-split-join:  ', ['%0.4f' % round(z, 4) for z in timeit.Timer("filtered = ' '.join(text.strip().split())", setup=setup).repeat(7, 1000)]
print 're.sub("\s+", " "):', ['%0.4f' % round(z, 4) for z in timeit.Timer("filtered = replaced = re.sub('\s+', ' ', text)", setup=setup_re).repeat(7, 1000)]

this gives (on my machine):

strip-split-join:   ['0.0043', '0.0045', '0.0047', '0.0046', '0.0043', '0.0040', '0.0045']
re.sub("\s+", " "): ['0.0265', '0.0254', '0.0246', '0.0248', '0.0238', '0.0255', '0.0266']

so the regex solution is slower by approx. a factor of 5.

Upvotes: 2

rofls
rofls

Reputation: 5115

If it's a file:

text = file('path/to/your/file.txt').read()
new_text = text.replace('\n', ' ')
print(new_text) # this will have no new lines
with open('output.txt', 'wr') as out:
    out.write(new_text) #this will write it to a file

You could also use regex, like PJSCopeland said:

import re
s = "Example String \n more example string"
replaced = re.sub('\s+', ' ', s)
print replaced

Dilettant's solution is concise, correct and also faster than using regex (by my measure), so I recommend that as a best solution:

filtered = ' '.join(text.strip().split())

Upvotes: 4

PJSCopeland
PJSCopeland

Reputation: 3006

Replace /\s+/g (every instance of at least one white-space character) with " ". (I'm not familiar with Python, unfortunately, so I don't know what the method call would be.)

Upvotes: 1

Related Questions