Reputation: 265
I am working with a data with thousands of rows but I have uneven columns as shown below:
AB 12 43 54
DM 33 41 45 56 33 77 88
MO 88 55 66 32 34
KL 10 90 87 47 23 48 56 12
First, I want to read the data in list or array and then find out the length of longest row.
Then, I will add zeros to the short rows to equal them to the longest one, so that I can iterate them as a 2D array.
I have tried a couple of other similar questions, but could not work out the problem.
I believe there is a way in Python to do this. Could anyone please help me out?
Upvotes: 1
Views: 2121
Reputation: 58461
I don't see any easier way to figure out the maximum row length but to do one pass and to find it. Then, we build the 2D array in a second pass. Something like:
from __future__ import print_function
import numpy as np
from itertools import chain
data = '''AB 12 43 54
DM 33 41 45 56 33 77 88
MO 88 55 66 32 34
KL 10 90 87 47 23 48 56 12'''
max_row_len = max(len(line.split()) for line in data.splitlines())
def padded_lines():
for uneven_line in data.splitlines():
line = uneven_line.split()
line += ['0']*(max_row_len - len(line))
yield line
# I will get back to the line below shortly, it unnecessarily creates the array
# twice in memory:
array = np.array(list(chain.from_iterable(padded_lines())), np.dtype(object))
array.shape = (-1, max_row_len)
print(array)
This prints:
[['AB' '12' '43' '54' '0' '0' '0' '0' '0']
['DM' '33' '41' '45' '56' '33' '77' '88' '0']
['MO' '88' '55' '66' '32' '34' '0' '0' '0']
['KL' '10' '90' '87' '47' '23' '48' '56' '12']]
The above code is inefficient in the sense that it creates the array twice in memory. I will get back to it; I think I can fix that.
However, numpy arrays are supposed to be homogeneous. You want to put strings (the first column) and integers (all the other columns) in the same 2D array. I still think you are on the wrong track here and should rethink the problem and pick another data structure or organize your data differently. I cannot help you with that since I don't know how you want to use the data.
(I will get back to the array created twice issue shortly.)
As promised, here is the solution to the efficiency issues. Note that my concerns were about memory consumption.
def main():
with open('/tmp/input.txt') as f:
max_row_len = max(len(line.split()) for line in f)
with open('/tmp/input.txt') as f:
str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))
def padded_lines():
with open('/tmp/input.txt') as f:
for uneven_line in f:
line = uneven_line.split()
line += ['0']*(max_row_len - len(line))
yield line
fmt = '|S%d' % str_len_max
array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))
This code could be made nicer but I will leave that up to you.
The memory consumption, measured with memory_profiler
on a randomly generated input file with 1000000 lines and uniformly distributed row lengths between 1 and 20:
Line # Mem usage Increment Line Contents
================================================
5 23.727 MiB 0.000 MiB @profile
6 def main():
7
8 23.727 MiB 0.000 MiB with open('/tmp/input.txt') as f:
9 23.727 MiB 0.000 MiB max_row_len = max(len(line.split()) for line in f)
10
11 23.727 MiB 0.000 MiB with open('/tmp/input.txt') as f:
12 23.727 MiB 0.000 MiB str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))
13
14 23.727 MiB 0.000 MiB def padded_lines():
15 with open('/tmp/input.txt') as f:
16 62.000 MiB 38.273 MiB for uneven_line in f:
17 line = uneven_line.split()
18 line += ['0']*(max_row_len - len(line))
19 yield line
20
21 23.727 MiB -38.273 MiB fmt = '|S%d' % str_len_max
22 array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))
23 62.004 MiB 38.277 MiB array.shape = (-1, max_row_len)
With the code eumiro's answer, and with the same input file:
Line # Mem usage Increment Line Contents
================================================
5 23.719 MiB 0.000 MiB @profile
6 def main():
7 23.719 MiB 0.000 MiB with open('/tmp/input.txt') as f:
8 638.207 MiB 614.488 MiB arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T
Comparing the memory consumption increments: My updated code consumes 16 times less memory than eumiro's (614.488/38.273 is approx. 16).
As for speed: My updated code runs for this input for 3.321s, eumiro's code runs for 5.687s, that is, mine is 1.7x faster on my machine. (Your mileage may vary.)
If efficiency is your primary concern (as suggested by your comment "Hi eumiro, I suppose this is more efficient." and then changing the accepted answer), then I am afraid you accepted the less efficient solution.
Don't get my wrong, eumiro's code is really concise, and I certainly learned a lot from it. If efficiency is not my primary concern, I would go with eumiro's solution too.
Upvotes: 2
Reputation: 212885
You can use itertools.izip_longest
which does the finding for the longest line for you:
import itertools as it
import numpy as np
with open('filename.txt') as f:
arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T
arr
is now:
array([['a', '1', '2', '0'],
['b', '3', '4', '5'],
['c', '6', '0', '0']],
dtype='|S1')
Upvotes: 1