Reputation: 101
I'm very new to Python. I've searched extensively for a solution to my problem, but I'm hitting dead ends left and right.
I've produced a series of arrays using following code:
fh = open(short_seq, 'r')
line_counter = 0
pos = [0]
array = [0.0 for x in range(101)]
for line in fh:
line_counter += 1.0
for i in line:
score = ord(i) - 33.0
array[pos] += score
pos += 1
After printing inside the loop I get a large series of arrays.
[1,2,3,4.....]
[2,3,4,5,6.....]
[3,4,5,6,7,8.....100]
...
I'd like to use NumPy to run stats on each column, in the specific alignment that they are printed out in, but once I'm outside of the loop I can only call the sum of entire loop. I tried np.concatenate, but that still left me with the sum of the arrays. If I use NumPy in the loop then I can only run stats on each column, one iteration at a time, rather than the whole series. My next idea was to ad each iteration into a two-dimensional matrix, but I couldn't figure how to keep the alignment.
Any help would be greatly appreciated.
EDIT: Here is a sample of my data (each of the four strings are right underneath on another in a text editor). I'm trying to convert a few thousand lines of ascii to numerical values. Each line has to be in an array 100 characters long, and then I need to run stats on each column.
CCCFFFFFHHHHHIJJJJJJIJJJJJJJJIJJJIJJJJJJJIJJIJJGIIIHIIIFGIGFHFGIIIHIHHGEHHFDFFFFFDDDDDBDDDDDDDDEDEEDD CCCFFFFFHHHHHJJJJJJJJJJIIIJJIGJJJJJJJJJJIJJJJJIJJJJJJIJIJJIJJIJJIJJHGHHHHFFCEFFFEEDAEEEFEEDDDB:ADDDD: CCCFFFFFHHHHHJIJJJIJJJIJJIJJIIJIIJJJJJJJJJJJJJIIJJJJJJJJJGHHHHFFFFFFEEEEEEEDDDDDEDDDDDDDDDDDDDDDDD>9< BCCFFFDFHHHHHJJJJJJJJJJJIIJJJI@HGIIIJJJJJIJJIJIIJJJJJJJJJHHHHHHFFFDDDDDDDDDDDDDDDD?BDDDD@CDDDDDBDDDDD
Upvotes: 1
Views: 1434
Reputation: 231540
array = [0.0 for x in range(101)]
is a list. array = np.zeros((101,),float)
is an array of the same size.
With for line in fh:
you get a line, a string. I expect for i in line:
to iterate over the characters in that string. Is that really what you want?
for i in line:
score = ord(i) - 33.0
array[pos] += score
pos += 1
Usually when people read a text file they want the values of columns separated by spaces or commas, e.g.
123, 345, 344, 233
343, 342, 343, 343
We use lines.split(',')
to split such as string into substrings. and float
or int
to turn those into numbers, eg.
data = [float(substring) for substring in line.split(',')]
Show us some of your data file, or a simplified version. It will be easier to help. A key question is, are the number of 'columns' consistent across lines.
Often when we iterate over the lines of an array, we collect the line values in a list. If the number of elements in the sublists is consistent we can turn it into a 2d array.
lines = []
for line in fh:
data = [float(i) for i in line.split(',')]
lines.append(data)
print(lines)
# A = np.array(lines)
===============================
With your sample lines I can do:
In [258]: with open('stack38175089.txt') as f:
lines=f.readlines()
.....:
In [259]: [len(l) for l in lines]
Out[259]: [102, 102, 102, 102]
In [260]: data=np.array([[ord(i) for i in l.strip()] for l in lines])
In [261]: data.shape
Out[261]: (4, 101)
In [262]: data
Out[262]:
array([[67, 67, 67, 70, 70, 70, 70, 70, 72, 72, 72, 72, 72, 73, 74, 74, 74,
74, 74, 74, 73, 74, 74, 74, 74, 74, 74, 74, 74, 73, 74, 74, 74, 73,
74, 74, 74, 74, 74, 74, 74, 73, 74, 74, 73, 74, 74, 71, 73, 73, 73,
72, 73, 73, 73, 70, 71, 73, 71, 70, 72, 70, 71, 73, 73, 73, 72, 73,
72, 72, 71, 69, 72, 72, 70, 68, 70, 70, 70, 70, 70, 68, 68, 68, 68,
68, 66, 68, 68, 68, 68, 68, 68, 68, 68, 69, 68, 69, 69, 68, 68],
...
[66, 67, 67, 70, 70, 70, 68, 70, 72, 72, 72, 72, 72, 74, 74, 74, 74,
74, 74, 74, 74, 74, 74, 74, 73, 73, 74, 74, 74, 73, 64, 72, 71, 73,
73, 73, 74, 74, 74, 74, 74, 73, 74, 74, 73, 74, 73, 73, 74, 74, 74,
74, 74, 74, 74, 74, 74, 72, 72, 72, 72, 72, 72, 70, 70, 70, 68, 68,
68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 63, 66, 68,
68, 68, 68, 64, 67, 68, 68, 68, 68, 68, 66, 68, 68, 68, 68, 68]])
With a 2d array like this I can easily shift the values (-33
), and apply statistical calculations over rows or columns.
I could have read the lines individually and collected the values in a list of lists. But this sample, and I suspect your whole file, is small enough to use readlines
.
Upvotes: 1