Reputation: 426
Several numbers do not split up even after using split function
707 K -7 -7 -6 -8 -2 -5 -8 -8 2 -5 -4 -7 -6 6 -8 -6 -7 -4 8 -6
708 L 0 0 -2 -3 -3 1 3 -3 0 -1 -3 4 -2 -5 -2 0 0 -3 -2 0
709 V -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -5 -7 -9 -4 -7
710 P -1 -2 2 1 -2 1 2 -1 1 -4 -4 2 -3 -4 3 1 0 -1 -3 -4
711 E -3 -3 -3 1 -6 1 5 -3 2 0 -1 -1 -1 -1 -5 -2 -1 -4 0 0
712 C -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -4 -7 -9 -9 -7
713 S -4 -4 1 1 -5 -2 -1 6 -1 -8 -7 -3 -7 -4 -2 -1 -3 -4 -4 -7
This matrix is from a text file and is very large ( both by number of lines in file as well as number of such files ) .
I successfully read the lines in python and split them up as
m = fp.readline(); [fp is the file pointer and reading is done in loop]
m = m.split() ; [splitting by elements ]
m = map(int,m[2:22]); [ mapping to make strings as integers from index 2 to 22 ]
but this gives an error for lines 709 and 712 ( see extreme left of matrix )
Traceback (most recent call last):
File "<pyshell#150>", line 1, in <module>
map(int,m[2:22].split())
ValueError: invalid literal for int() with base 10: '-7-10'
That's because '-7-10' didn't split up as expected '-7' '-10' due to formatting errors in files.
So question is how can I handle this formatting error so that I split and process the integers like other lines in a matrix ? Please remember this has to be done for very large lines and files so manually editing the formatting error is not feasible although such errors in a single file is within 100. Please help me ... thank you
Upvotes: 1
Views: 125
Reputation: 63737
As the data you are dealing is a fixed format data, its ideal to use the struct module and parse the data rather than using split
>>> from struct import *
>>> line="712 C -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -4 -7 -9 -9 -7"
>>> format_string = "<3s2s2s"+"3s"*20
>>> unpack(format_string,line)
('712', ' C', ' ', ' -7', '-10', ' -9', '-10', ' 12', ' -9', '-10', ' -9', ' -9', ' -8', ' -8', ' -9', ' -8', ' -9', ' -9', ' -4', ' -7', ' -9', ' -9', ' -7')
Upvotes: 0
Reputation: 180927
You could (as Simeon points out in the comments) just parse the numbers by fixed position; this comprehension does it by looping from 7 to 67, stepping by 3 and just converting the substring to an integer; (fields in your sample seem to start at posiiton 7 and each be 3 characters long)
>>> m
'709 V -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -5 -7 -9 -4 -7\n'
>>> a = [int(m[pos:pos+3]) for pos in range(7,67,3)]
>>> a
[-7, -10, -9, -10, 12, -9, -10, -9, -9, -8, -8, -9, -8, -9, -9, -5, -7, -9, -4, -7]
Upvotes: 2
Reputation: 5919
You can use the re
module to find all substrings matching a specific regex pattern:
>>> x = "-7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -4 -7 -9 -9 -7"
>>> import re
>>> re.findall( r"-?[0-9]+", x )
['-7', '-10', '-9', '-10', '12', '-9', '-10', '-9', '-9', '-8', '-8', '-9', '-8', '-9', '-9', '-4', '-7', '-9', '-9', '-7']
Of course, if you might get 7, 123
formatted as 7123
, then your only option is to split the string by indices rather than content patterns.
Upvotes: 0
Reputation: 76194
You could perform a replace
on all negative signs to ensure that they're preceded by at least one space:
m = "-7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -5 -7 -9 -4 -7"
print m.replace("-", " -").split()
Result:
['-7', '-10', '-9', '-10', '12', '-9', '-10', '-9', '-9', '-8', '-8', '-9', '-8', '-9', '-9', '-5', '-7', '-9', '-4', '-7']
Of course, this only helps if it's a negative number that's butting up against its neighbor. If you have colliding values like:
707 K -1 -2 -3
707 K -4123 -6
Then you can't easily separate the -4 from the 123.
Upvotes: 2