This project shows how to serve a TensorFlow image classification model as RESTful and gRPC based service with TFServing, Docker, and Kubernetes.
Chansung Park
By: Chansung Park and Sayak Paul
This project shows how to serve a TensorFlow image classification model as RESTful and gRPC based services with TFServing, Docker, and Kubernetes. The idea is to first create a custom TFServing docker image with a TensorFlow model, and then deploy it on a k8s cluster running on Google Kubernetes Engine (GKE). We are particularly interested in deploying the model as a gRPC endpoint with TF Serving on a k8s cluster using GKE and also with GitHub Actions to automate all the procedures when a new TensorFlow model is released.
👋 NOTE
Update Jule 29 2022: We published a blog post on load-testing the REST endpoint. Check it out on the TensorFlow blog here.
flowchart LR
A[First: Environmental Setup]-->B;
B[Second: Build TFServing Image]-->C[Third: Deploy on GKE];
deployment.yml
workflow file which is is only triggered when there is a new release for the current repository. It is subdivided into three parts to do the following tasks:gcloud
CLI toolkitgcr.io/gcp-ml-172005/tfs-resnet-cpu-opt
, and it is publicly available)Deployment
, Service
, and ConfigMap
, the custom TFServing image gets deployed.ConfigMap
is only used for batching enabled scenarios to inject batching configurations dynamically into the Deployment
.If the entire workflow goes without any errors, you will see something silimar to the text below. As you see, two external interfaces(8500 for RESTful, 8501 for gRPC) are exposed. You can check out the complete logs in the past runs.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
tfs-server LoadBalancer xxxxxxxxxx xxxxxxxxxx 8500:30869/TCP,8501:31469/TCP 23m
kubernetes ClusterIP xxxxxxxxxx <none> 443/TCP 160m
If you wonder how to perform gRPC inference, grpc_client.py provides code to perform inference with the gRPC client (grpc_client.py contains $ENDPOINT
placeholder. To replace it with your own endpoint, you can envsubst < grpc_client.py > grpc_client.py
after defining ENDPOINT
environment variable). TFServing API provides handy features to construct protobuf request message via predict_pb2.PredictRequest()
, and tf.make_tensor_proto(image)
creates protobuf compatible values from Tensor
data type.
We used Locust to conduct load tests for both TFServing and FastAPI. Below is the results for TFServing (gRPC) on a various setups, and you can find out the result for FastAPI (RESTful) in a separate repo. For specific instructions about how to install Locust and run a load test, follow this separate document.
--tensorflow_inter_op_parallelism
, --tensorflow_intra_op_parallelism
, and --enable_batching
options gives different results. From the results above,
--tensorflow_inter_op_parallelism
seems like 4. The value of --tensorflow_intra_op_parallelism
is fixed to the number of CPU cores since it specifies the number of threads to use to parallelize the execution of an individual op.--enable_batching
could give you better performance. However, since TFServing doesn't immediately response to each requests, there is a trade-off.2n-8c-16r-interop4
(2 Nodes of (8vCPU + 16G RAM)) configuration - 2 replicas of TFServing with --tensorflow_inter_op_parallelism=4 unless you care about dynamic batching capabilities. Or you can write a similar setup by referencing 2n-8c-16r-interop2-batch
but for smaller machines as well. 👋 NOTE
matplotlib
after collecting CSV files generated from Locust
.n
means the number of nodes(pods), c
means the number of CPU cores, r
means the RAM capacity, interop
means the number of --tensorflow_inter_op_parallelism
, and batch
means the batching configuration is enabled with this config.More load test comparisons with more ML inference frameworks such as NVIDIA's Triton Inference Server, KServe, and RedisAI.
Advancing this repo by providing a semi-automatic model deployment. To be more specific, when new codes implementing new ML model is pull requested, maintainers could trigger model performance evaluable on GCP's Vertex Training via comments
. The experiment results could be exposed through TensorBoard.dev or W&B. If it is approved, the code will be merged, the trained model will be released, and it is going to be deployed on GKE.
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.