Reputation: 656
I am trying to, for the first time, deploy an NLP ML model. To do this it was suggested that I use FastAPI and uvicorn. I have had some success in getting FastAPI to respond; however, I have not been able to successfully pass the dataframe and have it process it. I've tried using dictionaries and even attempted to convert the passed json to a dataframe.
With data_dict = data.dict()
I get:
ValueError: Iterable over raw text documents expected, string object received.
With data_dict = pd.DataFrame(data.dict())
I get:
ValueError: If using all scalar values, you must pass an index
I believe I understand the problem, my Data class is expecting a string which this is not; however, I have not been able to determine how to set and / or pass the expected data so that fit_transform() will work. Ultimately I will have a prediction returned based on the submitted messages value. Bonus if I can pass a dataframe of 1 or more rows and have predictions made and returned for each of the rows. The response will include the id, project, and the prediction so that we are in future able to leverage this response to post the prediction back to the original (requesting) system.
test_connection.py
#%%
import requests
import pandas as pd
import json
import os
from pprint import pprint
url = 'http://127.0.0.1:8000/predict'
print(os.getcwd())
#%%
df = pd.DataFrame(
{
'id': ['ab410483801c38', 'cd34148639180'],
'project': ['project1', 'project2'],
'messages': ['This is message 1', 'This is message 2']
}
)
to_predict_dict = df.iloc[0].to_dict()
#%%
r = requests.post(url, json=to_predict_dict)
main.py
#!/usr/bin/env python
# coding: utf-8
import pickle
import pandas as pd
import numpy as np
from pydantic import BaseModel
from sklearn.feature_extraction.text import TfidfVectorizer
# Server
import uvicorn
from fastapi import FastAPI
# Model
import xgboost as xgb
app = FastAPI()
clf = pickle.load(open('data/xgbmodel.pickle', 'rb'))
class Data(BaseModel):
# id: str
project: str
messages: str
@app.get("/ping")
async def test():
return {"ping": "pong"}
@app.post("/predict")
async def predict(data: Data):
# data_dict = data.dict()
data_dict = pd.DataFrame(data.dict())
tfidf_vect = TfidfVectorizer(stop_words="english", analyzer='word', token_pattern=r'\w{1,}')
tfidf_vect.fit_transform(data_dict['messages'])
# to_predict = tfidf_vect.transform(data_dict['messages'])
# prediction = clf.predict(to_predict)
return {"response": "Success"}
Upvotes: 2
Views: 10344
Reputation: 53
A new library called pandera
now supports direct passage of DataFrame
s without conversion via FastAPI. The docs are bit basic as of posting this, but may be worth reading: https://pandera.readthedocs.io/en/latest/fastapi.html#fastapi-integration.
Upvotes: 1
Reputation: 99
Frist, encode your dataFrame df
to JSON record-oriented:
r = requests.post(url, json=df.to_json(orient='records'))
.
Then, decode your data inside the /predict/
endpoint with:
df = pd.DataFrame(jsonable_encoder(data))
Remember to import the module from fastapi.encoders import jsonable_encoder
.
Upvotes: 1
Reputation: 656
I was able to address the issue by simply converting data.messages
into a list. I also had to make some unrelated changes, I had failed to pickle my vectorizer (string tokenizer).
import pickle
import pandas as pd
import numpy as np
import json
import time
from pydantic import BaseModel
from sklearn.feature_extraction.text import TfidfVectorizer
# Server / endpoint
import uvicorn
from fastapi import FastAPI
# Model
import xgboost as xgb
app = FastAPI(debug=True)
clf = pickle.load(open('data/xgbmodel.pickle', 'rb'))
vect = pickle.load(open('data/tfidfvect.pickle', 'rb'))
class Data(BaseModel):
id: str = None
project: str
messages: str
@app.get("/ping")
async def ping():
return {"ping": "pong"}
@app.post("/predict/")
def predict(data: Data):
start = time.time()
data_l = [data.messages] # make messages iterable.
to_predict = vect.transform(data_l)
prediction = clf.predict(to_predict)
exec_time = round((time.time() - start), 3)
return {
"id": data.id,
"project": data.project,
"prediction": prediction[0],
"execution_time": exec_time
}
if __name__ == "__main__":
uvicorn.run(app, host="127.0.0.1", port=8000)
Upvotes: 0
Reputation: 656
Probably not the most elegant solution but I've made progress using the following:
def predict(data: Data):
data_dict = pd.DataFrame(
{
'id': [data.id],
'project': [data.project],
'messages': [data.messages]
}
)
Upvotes: 1