PostgreSQL/Pandas insert sparse array containing null/np.nan/empty values

Question

I want to insert an array that contains empty values created from Pandas in Python, and these empty values default to np.nan in Pandas dataframe. I don't want them to be 'NaN' in my PostgreSQL database, I want my PostgreSQL arrays to contain empty values like this: '{123,24,,23}' so they are not counted in my aggregate functions like calculating the mean or standard deviation across indices. I am not sure if it is possible to have sparse arrays in PostgreSQL. There won't be a lot of sparse arrays in my dataset, I am just testing this for edge case purposes.

My table schema:

create_table = '''
            CREATE TABLE {t} (
                patient_id VARCHAR[20] PRIMARY KEY,
                gene_expression double precision []
            );
        '''

The relevant Python code (I don't know how to write the proper SQL code here). Here I converted the array into a string, because Python arrays cannot be sparse:

df = df.fillna('')
NCI = [1]
MCI = [2,3]
AD = [4,5]
other = [6]

insert_sql = '''
                INSERT INTO {t} (patient_id, gene_expression)
                VALUES (%s,%s);
            '''
cur = psql_conn.cursor()

for index, row in df.iterrows():
    arr = row[2:].tolist()
    postgres_arr = ','.join(map(str, arr))
    if row['DIAGNOSIS'].isdigit():
        if int(row['DIAGNOSIS']) in NCI:
            cur.execute(insert_sql.format(t='nci'), (row['PATIENT_ID'], postgres_arr,))

        elif int(row['DIAGNOSIS']) in MCI:
            cur.execute(insert_sql.format(t='mci'), (row['PATIENT_ID'], postgres_arr,))

        elif int(row['DIAGNOSIS']) in AD:
            cur.execute(insert_sql.format(t='ad'), (row['PATIENT_ID'], postgres_arr,))

        elif int(row['DIAGNOSIS']) in other:
            cur.execute(insert_sql.format(t='other'), (row['PATIENT_ID'], postgres_arr,))

    elif row['DIAGNOSIS'] == '':
        cur.execute(insert_sql.format(t='na'), (row['PATIENT_ID'], postgres_arr,))

    else:
        print('ERROR: unknown diagnosis {d}.'.format(d=diagnosis))

psql_conn.commit()
cur.close()

My Error:

psycopg2.DataError: malformed array literal: "{2.0,2.4,}"
LINE 3:                     VALUES ('X100_120417','{2.0,2.4,}');
                                                  ^
DETAIL:  Unexpected "}" character.

Dobob · Accepted Answer

After a few hours of trial and error:

Load this Pandas dataframe df from some CSV file:

+----+-------+--------------+
| id | stuff |    array     |
+----+-------+--------------+
|  0 | a     | {1,2,3}      |
|  1 | b     | {1,np.nan,3} |
|  2 | 45    | {np.nan,4,2} |
+----+-------+--------------+

process in pandas using:

df = df.fillna('NULL')
insert_sql = '''
                INSERT INTO {t} (patient_id, gene_expression)
                VALUES (%s,%s);
            '''

for index, row in df.iterrows():
    arr = row[2:].tolist()
    postgres_arr = '{' + ','.join(map(str,arr)) + '}'
    cur.execute(insert_sql.format(t='my_table'), (row['id'], postgres_arr,))

My main issue was recognizing that string literal 'NULL' automatically translate to PostgreSQL NULL keyword, which is ignored in calculations and results of aggregate functions return a value as if the NULL values are not there, versus NaN keyword where every operation with it results in NaN.

PostgreSQL/Pandas insert sparse array containing null/np.nan/empty values

Answers (2)

Related Questions