questionto42
questionto42

Reputation: 9640

How to install a Python package on Linux so that it is found by the already working PostgreSQL 13 plpython3u extension?

Spin-off:

I have tried to catch the idea of “Module not found” when importing a Python package within a plpython3u procedure, I did not know how to use the accepted answer there to make it run on Linux. Normally, this here should be a duplicate, but if there is a difference with with MacOS in the used paths, it might be good for a new question.

Installation of PostgreSQL 13

I am on WSL2 (Ubuntu 20.04). I installed with the commands of the official PostgreSQL Downloads page from the PostgreSQL Apt Repository for Ubuntu.

# Create the file repository configuration:
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'

# Import the repository signing key:
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

# Update the package lists:
sudo apt-get update

# Install the latest version of PostgreSQL.
# If you want a specific version, use 'postgresql-12' or similar instead of 'postgresql':
sudo apt-get -y install postgresql

This gave me PostgreSQL 13.

Installation of the right plpython3u

I installed plpython3u with commands from PostgreSQL: how to install plpythonu extension:

sudo apt-cache search ".*plpython3.*"
sudo apt-get install postgresql-contrib postgresql-plpython3-13

Because I am on WSL, I start the service with:

service postgresql start

Change role and run psql:

sudo su postgres
psql

PostgreSQL with plpython3u installation and tests without imported packages

Checking the Python version directly as a stored procedure in postgreSQL:

CREATE OR REPLACE FUNCTION return_version()
  RETURNS VARCHAR
AS $$
    import sys
    return sys.version
$$ LANGUAGE plpython3u;

Output:

CREATE FUNCTION

Test:

postgres=# SELECT return_version();
              return_version
------------------------------------------
 3.8.10 (default, Jun  2 2021, 10:49:15) +
 [GCC 9.4.0]
(1 row)

shows that for "PostgreSQL 13", "Python 3.8.10" gets installed.

And this is exactly the one to be found on Linux. If I just enter a Python version that I do not have, like 3.7:

python3.7

I get the overview:

Command 'python3.7' not found, did you mean:

  command 'python3.8' from deb python3.8 (3.8.10-0ubuntu1~20.04)
  command 'python3.9' from deb python3.9 (3.9.5-3~20.04.1)
  command 'python2.7' from deb python2.7 (2.7.18-1~20.04.1)

Confirming that there are no two parallel installations of Python 3.8 installed, it is clear that the Linux 3.8.10 is the used version for plpython3u in PostgreSQL.

plpython3u stored procedure tests with imported packages

I tried a function that imports packages, taken from Kmeans in PostgreSQL in a nutshell:

CREATE OR replace FUNCTION kmeans(input_table text, columns text[], clus_num int) RETURNS bytea AS
 
$$
 
from pandas import DataFrame
from sklearn.cluster import KMeans
from pickle import dumps
 
all_columns = ",".join(columns)
if all_columns == "":
    all_columns = "*"
 
rv = plpy.execute('SELECT %s FROM %s;' % (all_columns, plpy.quote_ident(input_table)))
 
frame = []
 
for i in rv:
    frame.append(i)
df = DataFrame(frame).astype('float')
kmeans = KMeans(n_clusters=clus_num, random_state=0).fit(df._get_numeric_data())
return dumps(kmeans)
 
$$ LANGUAGE plpython3u;

Output:

CREATE FUNCTION

Test throws the error:

postgres=# SELECT kmeans('my_table', ARRAY['col1', 'col2'],3);
ERROR:  ModuleNotFoundError: No module named 'pandas'
CONTEXT:  Traceback (most recent call last):
  PL/Python function "kmeans", line 3, in <module>
    from pandas import DataFrame
PL/Python function "kmeans"

Then, I installed pandas in Python 3.8:

python3.8 -m pip install pandas
Collecting pandas
  Downloading pandas-1.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
     |████████████████████████████████| 11.5 MB 6.0 MB/s
Collecting python-dateutil>=2.7.3
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
     |████████████████████████████████| 247 kB 4.7 MB/s
Collecting numpy>=1.17.3
  Downloading numpy-1.21.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.8 MB)
     |████████████████████████████████| 15.8 MB 131 kB/s
Collecting pytz>=2017.3
  Downloading pytz-2021.1-py2.py3-none-any.whl (510 kB)
     |████████████████████████████████| 510 kB 5.7 MB/s
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7.3->pandas) (1.14.0)
Installing collected packages: python-dateutil, numpy, pytz, pandas
  WARNING: The scripts f2py, f2py3 and f2py3.8 are installed in '/home/my_user/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed numpy-1.21.2 pandas-1.3.2 python-dateutil-2.8.2 pytz-2021.1

And pandas is now stored at:

python3.8 -m pip show pandas
Name: pandas
Version: 1.3.2
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /home/my_user/.local/lib/python3.8/site-packages
Requires: numpy, python-dateutil, pytz
Required-by:

But the test still throws the same error:

postgres=# SELECT kmeans('my_table', ARRAY['col1', 'col2'],3);
ERROR:  ModuleNotFoundError: No module named 'pandas'
CONTEXT:  Traceback (most recent call last):
  PL/Python function "kmeans", line 3, in <module>
    from pandas import DataFrame
PL/Python function "kmeans"

Question

How to install pandas (or any other package I need) so that it is found by the PostgreSQL 13 plpython3u extension on Linux?

PS: probably not needed side notes

Probably not needed: Use "which" in a plpython3u stored procedure to check Python

The spun-off question had another version test in it, but that was not showing anything from inside the extension, the commands are just the same as in the normal bash. Therefore, the value of the following is reduced to just a random test of the a plpython3u function. I add this only because you should never be 100 % sure. To me, it is clear that this does not say anything about the Python version that is actually used by plpython3u.

CREATE OR REPLACE FUNCTION get_py()
  RETURNS VARCHAR
AS $$
    import os
    return os.popen('which python3').read()
$$ LANGUAGE plpython3u;

Output:

CREATE FUNCTION

Test:

SELECT get_py();
      get_py
------------------
 /usr/bin/python3+

(1 row)

And the same with 'which python3.8':

select get_py();
       get_py
--------------------
 /usr/bin/python3.8+

(1 row)

Probably not needed: use make and a default Python version to install

Perhaps the default version of Python must be changed somewhere before installation? This is just a very vague guess from an installation using make, with Python settings in /etc/make.conf, here is the code that is used there (question from 2018):

Changing /etc/make.conf to

DEFAULT_VERSIONS+= python=3.6

It is said that this ran through and installed plpython3u (I have not tested this):

root@db% cd /usr/ports/databases/posgresql10-plpython
root@db% make fetch
root@db% make extract
root@db% cd /usr/ports/databases/postgresql10-plpython/work/postgresql-10.5
root@db% ./configure PYTHON=/usr/local/bin/python3 --with-python --with-libraries=/usr/local/lib --with-includes=/usr/local/include/
root@db% cd /usr/ports/databases/posgresql10-plpython
root@db% make install

Upvotes: 2

Views: 2031

Answers (1)

Adrian Klaver
Adrian Klaver

Reputation: 19742

The issue is that this:

python3.8 -m pip install pandas

installs a package(pandas in this case) to the site-packages in the home directory of the user running the command, so:

/home/my_user/.local/lib/python3.8/site-packages

The plpython3u extension running in Postgres is looking for the package in the system wide site-packages. To get the package there you need to do:

sudo python3.8 -m pip install pandas

Upvotes: 3

Related Questions