Reputation: 31
I have a 2D numpy array containing lists of tokenized words. I want to pad those lists with keras.processing.sequence-pad_sequences.
my 2d array's first dimension corresponds to dates. For every date, I have 25 (2nd dimension) lists of tokenized words (I want to pad these lists).
sample of my array:
>>>tokenized_news_seq_trunc[0]
array([list([915, 3691, 53, 48, 3692, 361, 579, 2432, 20]),
list([453, 2433, 309, 1094, 133, 3, 228, 2433, 133, 3, 145, 133, 113]),
list([2434, 3693, 251, 10, 16, 3694, 1731, 3695, 229, 1353, 580]),
..., list([865, 913, 555, 17, 8086]),
list([3057, 1237, 121, 8087, 811, 2233, 497, 8088, 1, 8089, 8090, 44, 199, 8, 1771, 1072, 8091, 24, 72, 1280]),
list([8092, 10, 16, 63, 151, 76, 622, 980, 1758, 3690, 174, 207, 840, 3279, 8093, 8094, 8095, 12, 1650, 735, 8096])],
dtype=object)
I have tried:
for i in range(tokenized_news_seq_trunc.shape[0]):
for j in range(tokenized_news_seq_trunc.shape[1]):
#print(tokenized_news_seq_trunc[i][j])
tokenized_news_seq_trunc[i[j]=pad_sequences(tokenized_news_seq_trunc[i][j], maxlen=MAX_LEN)
but I get an error: ValueError: sequences
must be a list of iterables. Found non-iterable: 915.
We can see that it tries to iterate over every element of the list and it doesn't work.
I have also tried:
for i in range(tokenized_news_seq_trunc.shape[0]):
#print(tokenized_news_seq_trunc[i][j])
tokenized_news_seq_trunc[i]=pad_sequences(tokenized_news_seq_trunc[i], maxlen=MAX_LEN)
but it returns:
ValueError: could not broadcast input array from shape (1989,27) into shape (1989)
(1989 is the number of dates, 27 is MAX_LEN)
Thanks for your help!
PS: Alternatively, I have a list of lists of lists containing my tokenized words, if there is a better way to do it with lists
Upvotes: 2
Views: 1041
Reputation: 31
I found a solution to pad nested sequences from the pypi anago documentation, but it does not truncate my sentences to MAX_WORDS (27). I have added the last if statement to implement truncating if sentences are too long
This function transforms a list of list sequences
into a 3D Numpy array of shape `(num_samples, max_sent_len, max_word_len)`.
Args:
sequences: List of lists of lists.
dtype: Type of the output sequences.
# Returns
x: Numpy array.
def pad_nested_sequences(sequences, dtype='int32'):
max_sent_len = 25
max_word_len = 27
for sent in sequences:
max_sent_len = max(len(sent), max_sent_len)
for word in sent:
max_word_len = max(len(word), max_word_len)
x = np.zeros((len(sequences), max_sent_len, max_word_len)).astype(dtype)
for i, sent in enumerate(sequences):
for j, word in enumerate(sent):
if j<max_word_len:
x[i, j, :len(word)] = word
return x
Upvotes: 1