Reputation: 3782
Main Problem:
numpy arrays of the same type and same size are not being column stacked together using np.hstack
, np.column_stack
, or np.concatenate(axis=1)
.
Explaination:
I don't understand what properties of a numpy array can change such that numpy.hstack
, numpy.column_stack
and numpy.concatenate(axis=1)
do not work properly. I am having a problem getting my real program to stack by column - it only appends to the rows. Is there some property of a numpy array which would cause this to be true? It doesn't throw an error, it just doesn't do the "right" or "normal" behavior.
I have tried a simple case which works as I would expect it to:
input:
a = np.array([['1', '2'], ['3', '4']], dtype=object)
b = np.array([['5', '6'], ['7', '8']], dtype=object)
np.hstack(a, b)
output:
np.array([['1', '2', '5', '6'], ['3', '4', '7', '8']], dtype=object)
That's perfectly fine by me, and what I want.
However, what I get from my program is this:
First array:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
..., ['908.791', '-0.015765'] ['908.073', '-0.0154842'] []]
Second array (to be added on in columns):
[['29.8989', '26.8556'] ['29.8659', '26.7969'] ['29.902', '29.0183'] ...,
['908.791', '943.621'] ['908.073', '940.529'] []]
What should be the two arrays side by side or in columns:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
..., ['908.791', '943.621'] ['908.073', '940.529'] []]
Clearly, this isn't the right answer.
The module creating this problem is rather long (I will give it at the bottom), but here is a simplification of it which still works (performs the right column stacking) like the first example:
import numpy as np
def contiguous_regions(condition):
d = np.diff(condition)
idx, = d.nonzero()
idx += 1
if condition[0]:
idx = np.r_[0, idx]
if condition[-1]:
idx = np.r_[idx, condition.size]
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
total_array = np.array([['1', '2'], ['3', '4'], ['strings','here'], ['5', '6'], ['7', '8']], dtype=object)
where_number = np.array(map(is_number, total_array))
contig_ixs = contiguous_regions(where_number)
print contig_ixs
t = tuple(total_array[s[0]:s[1]] for s in contig_ixs)
print t
print np.hstack(t)
It basically looks through an array of lists and finds the longest set of continuous numbers. I would like to column stack those sets of data if they are of the same length.
Here is the real module providing the problem:
import numpy as np
def retrieve_XY(file_path):
# XY data is read in from a file in text format
file_data = open(file_path).readlines()
# The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
file_data = np.array(map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data))
# Remove empty lists, make into numpy array
xy_array = np.array(filter(None, column_stacked_data_chain))
# Each line is searched to make sure that all items in the line are a number
where_num = np.array(map(is_number, file_data))
# The data is searched for the longest contiguous chain of numbers
contig = contiguous_regions(where_num)
try:
# Data lengths (number of rows) for each set of data in the file
data_lengths = contig[:,1] - contig[:,0]
# Get the maximum length of data (max number of contiguous rows) in the file
maxs = np.amax(data_lengths)
# Find the indices for where this long list of data is (index within the indices array of the file)
# If there are two equally long lists of data, get both indices
longest_contig_idx = np.where(data_lengths == maxs)
except ValueError:
print 'Problem finding contiguous data'
return np.array([])
###############################################################################################
###############################################################################################
# PROBLEM ORIGINATES HERE
# Starting and stopping indices of the contiguous data are stored
ss = contig[longest_contig_idx]
# The file data with this longest contiguous chain of numbers
# If there are multiple sets of data of the same length, they are added in columns
longest_data_chains = tuple([file_data[i[0]:i[1]] for i in ss])
print "First array:"
print longest_data_chains[0]
print
print "Second array (to be added on in columns):"
print longest_data_chains[1]
column_stacked_data_chain = np.concatenate(longest_data_chains, axis=1)
print
print "What should be the two arrays side by side or in columns:"
print column_stacked_data_chain
###############################################################################################
###############################################################################################
xy = np.array(zip(*xy_array), dtype=float)
return xy
#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition)
idx, = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right.
idx += 1
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
UPDATE:
I got it to work with the help of @hpaulj . Apparently the fact that the data was structured like np.array([['1','2'],['3','4']])
in both cases was not sufficient since the real case I was using had a dtype=object
and there were some strings in the lists. Therefore, numpy was seeing a 1d array instead of a 2d array, which is required.
The solution which fixed this was calling a map(float, data)
to every list that was given by the readlines
function.
Here is what I ended up with:
import numpy as np
def retrieve_XY(file_path):
# XY data is read in from a file in text format
file_data = open(file_path).readlines()
# The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
file_data = map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data)
# Remove empty lists, make into numpy array
xy_array = np.array(filter(None, file_data))
# Each line is searched to make sure that all items in the line are a number
where_num = np.array(map(is_number, xy_array))
# The data is searched for the longest contiguous chain of numbers
contig = contiguous_regions(where_num)
try:
# Data lengths
data_lengths = contig[:,1] - contig[:,0]
# All maximums in contiguous data
maxs = np.amax(data_lengths)
longest_contig_idx = np.where(data_lengths == maxs)
except ValueError:
print 'Problem finding contiguous data'
return np.array([])
# Starting and stopping indices of the contiguous data are stored
ss = contig[longest_contig_idx]
print ss
# The file data with this longest contiguous chain of numbers
# Float must be cast to each value in the lists of the contiguous data and cast to a numpy array
longest_data_chains = np.array([[map(float, n) for n in xy_array[i[0]:i[1]]] for i in ss])
# If there are multiple sets of data of the same length, they are added in columns
column_stacked_data_chain = np.hstack(longest_data_chains)
xy = np.array(zip(*column_stacked_data_chain), dtype=float)
return xy
#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition)
idx, = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right.
idx += 1
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
This function will now take in a file and output the longest contiguous number type data found within it. If there are multiple data sets found with the same length, it column stacks them.
Upvotes: 2
Views: 3135
Reputation: 67427
It's the empty list at the end of your array's that's causing your problem:
>>> a = np.array([[1, 2], [3, 4]])
>>> b = np.array([[1, 2], [3, 4], []])
>>> a.shape
(2L, 2L)
>>> a.dtype
dtype('int32')
>>> b.shape
(3L,)
>>> b.dtype
dtype('O')
Because of that empty list at the end, instead of creating a 2D array it is creating a 1D, with every item holding a two item long list object.
Upvotes: 1