Eric
Eric

Reputation: 656

Passing a pandas dataframe to FastAPI for NLP ML

I am trying to, for the first time, deploy an NLP ML model. To do this it was suggested that I use FastAPI and uvicorn. I have had some success in getting FastAPI to respond; however, I have not been able to successfully pass the dataframe and have it process it. I've tried using dictionaries and even attempted to convert the passed json to a dataframe.

With data_dict = data.dict() I get: ValueError: Iterable over raw text documents expected, string object received.

With data_dict = pd.DataFrame(data.dict()) I get: ValueError: If using all scalar values, you must pass an index

I believe I understand the problem, my Data class is expecting a string which this is not; however, I have not been able to determine how to set and / or pass the expected data so that fit_transform() will work. Ultimately I will have a prediction returned based on the submitted messages value. Bonus if I can pass a dataframe of 1 or more rows and have predictions made and returned for each of the rows. The response will include the id, project, and the prediction so that we are in future able to leverage this response to post the prediction back to the original (requesting) system.

test_connection.py

#%%
import requests
import pandas as pd
import json
import os
from pprint import pprint

url = 'http://127.0.0.1:8000/predict'
print(os.getcwd())
#%%
df = pd.DataFrame(
    {
        'id': ['ab410483801c38', 'cd34148639180'],
        'project': ['project1', 'project2'], 
        'messages': ['This is message 1', 'This is message 2']
    }
)
to_predict_dict = df.iloc[0].to_dict()
#%%
r = requests.post(url, json=to_predict_dict)

main.py

#!/usr/bin/env python
# coding: utf-8

import pickle
import pandas as pd
import numpy as np
from pydantic import BaseModel
from sklearn.feature_extraction.text import TfidfVectorizer

# Server
import uvicorn
from fastapi import FastAPI
# Model
import xgboost as xgb


app = FastAPI()

clf = pickle.load(open('data/xgbmodel.pickle', 'rb'))

class Data(BaseModel):
    # id: str
    project: str
    messages: str

@app.get("/ping")
async def test():
    return {"ping": "pong"}

@app.post("/predict")
async def predict(data: Data):
#    data_dict = data.dict()
    data_dict = pd.DataFrame(data.dict())
    tfidf_vect = TfidfVectorizer(stop_words="english", analyzer='word', token_pattern=r'\w{1,}')
    tfidf_vect.fit_transform(data_dict['messages'])
#   to_predict = tfidf_vect.transform(data_dict['messages'])
#   prediction = clf.predict(to_predict)

    return {"response": "Success"}

Upvotes: 2

Views: 10344

Answers (4)

poldpold
poldpold

Reputation: 53

A new library called pandera now supports direct passage of DataFrames without conversion via FastAPI. The docs are bit basic as of posting this, but may be worth reading: https://pandera.readthedocs.io/en/latest/fastapi.html#fastapi-integration.

Upvotes: 1

J. Javier Gálvez
J. Javier Gálvez

Reputation: 99

Frist, encode your dataFrame df to JSON record-oriented:

r = requests.post(url, json=df.to_json(orient='records')).

Then, decode your data inside the /predict/ endpoint with:

df = pd.DataFrame(jsonable_encoder(data))

Remember to import the module from fastapi.encoders import jsonable_encoder.

Upvotes: 1

Eric
Eric

Reputation: 656

I was able to address the issue by simply converting data.messages into a list. I also had to make some unrelated changes, I had failed to pickle my vectorizer (string tokenizer).

import pickle
import pandas as pd
import numpy as np
import json
import time
from pydantic import BaseModel
from sklearn.feature_extraction.text import TfidfVectorizer

# Server / endpoint
import uvicorn
from fastapi import FastAPI
# Model
import xgboost as xgb


app = FastAPI(debug=True)

clf = pickle.load(open('data/xgbmodel.pickle', 'rb'))
vect = pickle.load(open('data/tfidfvect.pickle', 'rb'))

class Data(BaseModel):
    id: str = None
    project: str
    messages: str

@app.get("/ping")
async def ping():
    return {"ping": "pong"}

@app.post("/predict/")
def predict(data: Data):
    start = time.time()
    data_l = [data.messages] # make messages iterable.
    to_predict = vect.transform(data_l)
    prediction = clf.predict(to_predict)

    exec_time = round((time.time() - start), 3)
    return {
        "id": data.id,
        "project": data.project,
        "prediction": prediction[0], 
        "execution_time": exec_time
        }

if __name__ == "__main__":
    uvicorn.run(app, host="127.0.0.1", port=8000)

Upvotes: 0

Eric
Eric

Reputation: 656

Probably not the most elegant solution but I've made progress using the following:

def predict(data: Data):
    data_dict = pd.DataFrame(
        {
            'id': [data.id],
            'project': [data.project],
            'messages': [data.messages]
        }
    )

Upvotes: 1

Related Questions