Ashutosh
Ashutosh

Reputation: 426

Python Split Specific case handling

Several numbers do not split up even after using split function

707 K   -7 -7 -6 -8 -2 -5 -8 -8  2 -5 -4 -7 -6  6 -8 -6 -7 -4  8 -6
708 L    0  0 -2 -3 -3  1  3 -3  0 -1 -3  4 -2 -5 -2  0  0 -3 -2  0
709 V   -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -5 -7 -9 -4 -7
710 P   -1 -2  2  1 -2  1  2 -1  1 -4 -4  2 -3 -4  3  1  0 -1 -3 -4
711 E   -3 -3 -3  1 -6  1  5 -3  2  0 -1 -1 -1 -1 -5 -2 -1 -4  0  0
712 C   -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -4 -7 -9 -9 -7
713 S   -4 -4  1  1 -5 -2 -1  6 -1 -8 -7 -3 -7 -4 -2 -1 -3 -4 -4 -7

This matrix is from a text file and is very large ( both by number of lines in file as well as number of such files ) .

I successfully read the lines in python and split them up as

m = fp.readline(); [fp is the file pointer and reading is done in loop]

m = m.split() ; [splitting by elements ]

m = map(int,m[2:22]); [ mapping to make strings as integers from index 2 to 22 ]

but this gives an error for lines 709 and 712 ( see extreme left of matrix )

Traceback (most recent call last):
  File "<pyshell#150>", line 1, in <module>
    map(int,m[2:22].split())
ValueError: invalid literal for int() with base 10: '-7-10'

That's because '-7-10' didn't split up as expected '-7' '-10' due to formatting errors in files.

So question is how can I handle this formatting error so that I split and process the integers like other lines in a matrix ? Please remember this has to be done for very large lines and files so manually editing the formatting error is not feasible although such errors in a single file is within 100. Please help me ... thank you

Upvotes: 1

Views: 125

Answers (4)

Abhijit
Abhijit

Reputation: 63737

As the data you are dealing is a fixed format data, its ideal to use the struct module and parse the data rather than using split

>>> from struct import *
>>> line="712 C   -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -4 -7 -9 -9 -7"
>>> format_string = "<3s2s2s"+"3s"*20
>>> unpack(format_string,line)
('712', ' C', '  ', ' -7', '-10', ' -9', '-10', ' 12', ' -9', '-10', ' -9', ' -9', ' -8', ' -8', ' -9', ' -8', ' -9', ' -9', ' -4', ' -7', ' -9', ' -9', ' -7')

Upvotes: 0

Joachim Isaksson
Joachim Isaksson

Reputation: 180927

You could (as Simeon points out in the comments) just parse the numbers by fixed position; this comprehension does it by looping from 7 to 67, stepping by 3 and just converting the substring to an integer; (fields in your sample seem to start at posiiton 7 and each be 3 characters long)

>>> m
'709 V   -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -5 -7 -9 -4 -7\n'

>>> a = [int(m[pos:pos+3]) for pos in range(7,67,3)]

>>> a
[-7, -10, -9, -10, 12, -9, -10, -9, -9, -8, -8, -9, -8, -9, -9, -5, -7, -9, -4, -7]

Upvotes: 2

svk
svk

Reputation: 5919

You can use the re module to find all substrings matching a specific regex pattern:

>>> x = "-7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -4 -7 -9 -9 -7"
>>> import re
>>> re.findall( r"-?[0-9]+", x )
['-7', '-10', '-9', '-10', '12', '-9', '-10', '-9', '-9', '-8', '-8', '-9', '-8', '-9', '-9', '-4', '-7', '-9', '-9', '-7']

Of course, if you might get 7, 123 formatted as 7123, then your only option is to split the string by indices rather than content patterns.

Upvotes: 0

Kevin
Kevin

Reputation: 76194

You could perform a replace on all negative signs to ensure that they're preceded by at least one space:

m = "-7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -5 -7 -9 -4 -7"
print m.replace("-", " -").split()

Result:

['-7', '-10', '-9', '-10', '12', '-9', '-10', '-9', '-9', '-8', '-8', '-9', '-8', '-9', '-9', '-5', '-7', '-9', '-4', '-7']

Of course, this only helps if it's a negative number that's butting up against its neighbor. If you have colliding values like:

707 K   -1 -2 -3
707 K   -4123 -6

Then you can't easily separate the -4 from the 123.

Upvotes: 2

Related Questions