aysha
aysha

Reputation: 213

Removing white space from txt with python

I have a .txt file (scraped as pre-formatted text from a website) where the data looks like this:

B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        

I'd like to remove all extra spaces (they're actually different number of spaces, not tabs) in between the columns. I'd also then like to replace it with some delimiter (tab or pipe since there's commas within the data), like so:

ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

Looked around and found that the best options are using regex or shlex to split. Two similar scenarios:

Upvotes: 11

Views: 7928

Answers (6)

Ahasanul Haque
Ahasanul Haque

Reputation: 11144

What about this?

your_string ='ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS'
print re.sub(r'\s{2,}','|',your_string.strip())

Output:

ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

Expanation:

I've used re.sub() which takes 3 parameter, a pattern, a string you want to replace with and the string you want to work on.

What I've done is taking at least two space together , I 've replaced them with a | and applied it on your string.

Upvotes: 7

Fruch
Fruch

Reputation: 468

Looks like this library can solve this quite nicely: http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery

Impressive...

Upvotes: 0

Jonathan Vanasco
Jonathan Vanasco

Reputation: 15680

It looks like your data is in a "text-table" format.

I recommend using the first row to figure out the start point and length of each column (either by hand or write a script with regex to determine the likely columns), then writing a script to iterate the rows of the file, slice the row into column segments, and apply strip to each segment.

If you use a regex, you must keep track of the number of columns and raise an error if any given row has more than the expected number of columns (or a different number than the rest). Splitting on two-or-more spaces will break if a column's value has two-or-more spaces, which is not just entirely possible, but also likely. Text-tables like this aren't designed to be split on a regex, they're designed to be split on the column index positions.

In terms of saving the data, you can use the csv module to write/read into a csv file. That will let you handle quoting and escaping characters better than specifying a delimiter. If one of your columns has a | character as a value, unless you're encoding the data with a strategy that handles escapes or quoted literals, your output will break on read.

Parsing the text above would look something like this (i nested a list comprehension with brackets instead of the traditional format so it's easier to understand):

cols = ((0,34),
        (34, 50),
        (50, 59),
        (59, None),
        )
for line in lines:
    cleaned = [i.strip() for i in [line[s:e] for (s, e) in cols]]
    print cleaned

then you can write it with something like:

import csv
with open('output.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter='|',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for line in lines:
        spamwriter.writerow([line[col_start:col_end].strip()
                             for (col_start, col_end) in cols
                             ])

Upvotes: 3

VelikiiNehochuha
VelikiiNehochuha

Reputation: 4373

s = """B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS
"""

# Update
re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
In [71]: print re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

Upvotes: 5

Sнаđошƒаӽ
Sнаđошƒаӽ

Reputation: 17612

Considering there are at least two spaces separating the columns, you can use this:

lines = [
'B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  ',
'ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        '
]

for line in lines:
    parts = []
    for part in line.split('  '):
        part = part.strip()
        if part:  # checking if stripped part is a non-empty string
            parts.append(part)
    print('|'.join(parts))

Output for your input:

B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

Upvotes: 3

timgeb
timgeb

Reputation: 78750

You can apply the regex '\s{2,}' (two or more whitespace characters) to each line and substitute the matches with a single '|' character.

>>> import re
>>> line = 'ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'

Stripping any leading and trailing whitespace from the line before applying re.sub ensures that you won't get '|' characters at the start and end of the line.

Your actual code should look similar to this:

import re
with open(filename) as f:
    for line in f:
        subbed = re.sub('\s{2,}', '|', line.strip())
        # do something here

Upvotes: 7

Related Questions