In this blog post, you will learn how easy to get the run ID in your current Azure Databricks Notebook. This is helpful especially if you want to use that run ID for deployment within the same notebook that you’re working on.
FTF (First Things First)
Install MLflow
What is MLflow?
“MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components: MLflow Tracking, MLflow Projects, MLflow Models & Model Registry” - Read more in MLflow.org
In this blog, we will only create a basic usage of MLflow where it logs a model, save and track.
Setup
You can install MLflow in workspace library or you using python command. In this sample, we’re going to use the python notebook.
dbutils.library.installPyPI("mlflow")
dbutils.library.restartPython()
And we import it.
# Import mlflow
import mlflow
import mlflow.sklearn
# import for unique folder naming
import uuid
Basic Usage
# create unique folder number
unique_folder = str(uuid.uuid1())
def train_and_log_model(data):
# Split dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3,random_state=109)
# Start an MLflow run; the "with" keyword ensures we'll close the run even if this cell crashes
with mlflow.start_run():
model = GaussianNB()
# train the model
model.fit(x_train,y_train)
# Predict
preds = model.predict(x_test)
mlflow.sklearn.log_model(model, "model")
modelpath = "/dbfs/mlflow/"+unique_folder+"/model"
mlflow.sklearn.save_model(model, modelpath)
client = mlflow.tracking.MlflowClient()
active_run = client.get_run(mlflow.active_run().info.run_id).data
log_model_history_json = json.loads(active_run.tags['mlflow.log-model.history'])
log_model_history_data = log_model_history_json[0]
return log_model_history_data['run_id']
Use it
run_id = train_and_log_model(data) # call to run
print(run_id)
# result 1ccc2a132124468e91e4ea636b6f4023
You can view the Run tab to see the log model.
In the above code, after we save our model, we use the mlflow.tracking.MlflowClient()
class. What is MlFlow.Tracking? It is a module that provides Python CRUD interface to MLflow experiments and runs. MlflowClient()
class has many methods, but what we need is the get_run(id)
because it fetches the run from the backend store within the current notebook and the resulting run contains a collection of run metadata, as well as a collection of run parameters, tags, and metrics.
I would also advice if you also add more mlflow tracking model logs like Parameters
, Metrics
, Tags
and even notes in saving your model.
Learn more from MLFlow Documetation
If you have some questions or comments, please drop it below 👇 :)