World to pixel transformation in Pyrender

Question

I'm trying to transform a point in a 3D world rendered with pyrender to pixel coordinates. The world to camera frame transformation seems to work, however the camera to pixel frame transformation is incorrect and I can't figure out what I'm doing wrong. I appreciate any hints!

The goal is to get the pixel coordinates uvw of the world-point UVW. Currently, I do the following:

Create Camera:

I create a camera from an already existing intrinsic matrix (= K). I do this mainly for debugging purposes, so I can be sure that K is correct:

K = np.array([[415.69219382,   0.        , 320.        ],
   [  0.        , 415.69219382, 240.        ],
   [  0.        ,   0.        ,   1.        ]])
K = np.ascontiguousarray(K, dtype=np.float32)
p_cam = pyrender.camera.IntrinsicsCamera(fx = K[0][0], fy = [1][1], cx =[0][2],  cy = [1][2])

scene.add(p_cam, pose=cam_pose.get_transformation_matrix(x=6170., y=4210., z=60., yaw=20, pitch=0, roll=40)) # cam_pose is my own class

Create transformation matrix

I'm creating an transformation matrix with an extrinsic rotation.

def get_transformation_matrix(self, x, y, z, yaw, pitch, roll):
    from scipy.spatial.transform import Rotation as R
    '''
    yaw = rotate around z axis
    pitch = rotate around y axis
    roll = rotate around x axis
    '''
    xyz = np.array([
        [x],
        [y],
        [z]
    ])
    rot = rot_matrix = R.from_euler('zyx', [yaw, pitch, roll], degrees=True).as_matrix()
    last_row = np.array([[0,0,0,1]])
    tf_m = np.concatenate((np.concatenate((rot,xyz), axis = 1), last_row), axis = 0)
    return np.ascontiguousarray(tf_m, dtype=np.float32)

Render image

Using the created camera, I render the following image. The point I'm trying to transform is the tip of the roof, which approximately has the pixel coordinates (500,160). I marked it in the 3D scene with the pink cylinder.

Transform world to pixel frame

from icecream import ic
K = np.concatenate((K, [[0],[0],[0]]), axis = 1)
UVW1 = [[6184],[4245],[38],[1]] #the homogeneous coordinates of the pink cylinder in the world frame
world_to_camera = np.linalg.inv(cam_pose.transformation_matrix).astype('float32') @ UVW1
ic(world_to_camera)
camera_to_pixel = K @ world_to_camera
ic(camera_to_pixel/camera_to_pixel[2]) #Transforming the homogeneous coordinates back

Output:

ic| world_to_camera: array([[ 17.48892188],
                            [  7.11796755],
                            [-39.35071968],
                            [  1.        ]])

ic| camera_to_pixel/camera_to_pixel[2]: array([[135.25094424],
                                               [164.80738424],
                                               [  1.        ]])

Results

To me, the world_to_camera pose seems like it might be correct (i might be wrong). However, when transforming from camera frame to pixel frame, the x-coordinate (135) is wrong (the y-coordinate (164) might still make sense).

Attached a screenshot of the 3D scene. The yellow cylinder+axes represent the camera, while the blue point represents the point I'm trying to transform (earlier pink in the rendered image).

So to me, the only source of error could be the intrinsic matrix, however I'm defining this matrix myself, so I don't see how it could be incorrect. Is there something I'm blind to?