Jacek
Jacek

Reputation: 765

creating index columns with python

As a minimal working example, I have a file.txt containing a list of numbers:

1.1
2.1
3.1
4.1
5.1
6.1
7.1
8.1

which actually should be presented with indices that makes it a 3D array

0   0   1.1
1   0   2.1
0   1   3.1
1   1   4.1
0   2   5.1
1   2   6.1
0   3   7.1
1   3   8.1

I want to import the 3D array into python and have been using bash to generate the indices and then pasting the index to file.txt before importing the resulting full.txt in python using pandas:

for ((y=0;y<=3;y++)); do
    for ((x=0;x<=1;x++)); do
        echo -e "$x\t$y" 
        done
    done
done > index.txt
paste index.txt file.txt> full.txt

The writing of index.txt has been slow in my actual code, which has x up to 9000 and y up to 5000. Is there a way to generate the indices into the first 2 columns of a 2D python numpy array so I only need to import the data from file.txt as as the third column?

Upvotes: 1

Views: 139

Answers (3)

Michael Szczesny
Michael Szczesny

Reputation: 5026

I would recommend using pandas for loading the data and managing columns with different types. We can generate the indices with np.indices with the desired dimensions and reshape to match your format. Then concatenate 'file.txt'.

Creating the index for (9000,5000) takes about 950ms on a colab instance.

import numpy as np
import pandas as pd

x,y = 2,4 # dimensions, also works with 9000,5000 but assumes 'file.txt' has the correct size

pd.concat([
    pd.DataFrame(np.indices((x,y)).ravel('F').reshape(-1,2), columns=['ind1','ind2']),
    pd.read_csv('file.txt', header=None, names=['Value'])
    ], axis=1)

Out:

   ind1  ind2  Value
0     0     0    1.1
1     1     0    2.1
2     0     1    3.1
3     1     1    4.1
4     0     2    5.1
5     1     2    6.1
6     0     3    7.1
7     1     3    8.1

How this works

First create the indices for your desired dimensions with np.indices

np.indices((2,4))

Out:

array([[[0, 0, 0, 0],
        [1, 1, 1, 1]],

       [[0, 1, 2, 3],
        [0, 1, 2, 3]]])

Which gives us the right indices but in the wrong order.
With np.ravel('F') we can specify to flatten the array in columns first order

np.indices((2,4)).ravel('F')

Out:

array([0, 0, 1, 0, 0, 1, 1, 1, 0, 2, 1, 2, 0, 3, 1, 3])

To get the desired columns reshape into a 2D array with shape (8,2). With (-1,2) the first dimension is inferred.

np.indices((2,4)).ravel('F').reshape(-1,2)

Out:

array([[0, 0],
       [1, 0],
       [0, 1],
       [1, 1],
       [0, 2],
       [1, 2],
       [0, 3],
       [1, 3]])

Then convert into a dataframe with columns ind1 and ind2.


Working with more dimensions

pd.DataFrame(np.indices((2,4,3)).ravel('F').reshape(-1,3)).add_prefix('ind')

Out:

    ind0  ind1  ind2
0      0     0     0
1      1     0     0
2      0     1     0
3      1     1     0
4      0     2     0
5      1     2     0
6      0     3     0
7      1     3     0
8      0     0     1
9      1     0     1
10     0     1     1
11     1     1     1
12     0     2     1
13     1     2     1
14     0     3     1
15     1     3     1
16     0     0     2
17     1     0     2
18     0     1     2
19     1     1     2
20     0     2     2
21     1     2     2
22     0     3     2
23     1     3     2

Upvotes: 2

Aaj Kaal
Aaj Kaal

Reputation: 1304

If you want to stick to your bash then you can avoid two loops:

Code:

for ((y=0;y<=3;y++)); do
    echo -e "0\t$y\n1\t$y"
done

Output:

0       0
1       0
0       1
1       1
0       2
1       2
0       3
1       3

above in python is:

Code:

for y in range(4):
    print(f'0\t{y}\n1\t{y}')

Output:

0       0
1       0
0       1
1       1
0       2
1       2
0       3
1       3

Upvotes: 0

FloLie
FloLie

Reputation: 1841

Here is a quick example how to create the 3D array from a 1D array. As a dummy i have random numbers. Then it creates tuples of x,y,value.

It takes about a minute for 45M rows

from random import randrange

x = 5000
y = 9000

numbers = [randrange(100000,999999) for i in range(x*y)]


array = [(a,b, numbers[b*(x-1)+a]) for a in range(x) for b in range(y)]

Output

pd.DataFrame(array)
Out[23]: 
             0     1       2
0            0     0  878704
1            0     1  524573
2            0     2  943657
3            0     3  496507
4            0     4  802714```

Upvotes: 0

Related Questions