Deploying and saving a model
What is DeploySupervisedModel
?
- This class lets one save a model (for recurrent use) and push predictions to a database
- One can do both classification (ie, predict Y/N) as well as regression (ie, predict a numeric field).
Am I ready for model deployment?
Only if you've already completed these steps:
- You've found a model work that works well on your data
- You've created a column called InTestWindowFLG (or something similar), where 'Y' denotes rows that need a prediction and 'N' for rows that train the model.
- You've created the SQL table structure to receive predictions
For classification predictions:
CREATE TABLE [SAM].[dbo].[HCPyDeployClassificationBASE] (
[BindingID] [int] ,
[BindingNM] [varchar] (255),
[LastLoadDTS] [datetime2] (7),
[PatientEncounterID] [decimal] (38, 0), --< change to your grain col
[PredictedProbNBR] [decimal] (38, 2),
[Factor1TXT] [varchar] (255),
[Factor2TXT] [varchar] (255),
[Factor3TXT] [varchar] (255))
For regression predictions:
CREATE TABLE [SAM].[dbo].[HCPyDeployRegressionBASE] (
[BindingID] [int],
[BindingNM] [varchar] (255),
[LastLoadDTS] [datetime2] (7),
[PatientEncounterID] [decimal] (38, 0), --< change to your grain col
[PredictedValueNBR] [decimal] (38, 2),
[Factor1TXT] [varchar] (255),
[Factor2TXT] [varchar] (255),
[Factor3TXT] [varchar] (255))
Step 1: Pull in the data
For SQL:
import pyodbc
cnxn = pyodbc.connect("""SERVER=localhost;
DRIVER={SQL Server Native Client 11.0};
Trusted_Connection=yes;
autocommit=True""")
df = pd.read_sql(
sql="""SELECT
*
FROM [SAM].[dbo].[HCPyDiabetesClinical]""",
con=cnxn)
# Handle missing data (if needed)
df.replace(['None'],[None],inplace=True)
For CSV:
df = pd.read_csv('healthcareai/tests/fixtures/HCPyDiabetesClinical.csv',
na_values=['None'])
Step 2: Set your data-prep parameters
The DeploySupervisedModel
cleans and prepares the data prior to model
creation.
- Return: an object.
- Arguments: : - modeltype: a string. This will either be 'classification' or 'regression'. - df: a data frame. The data your model will be based on. - predictedcol: a string. Name of variable (or column) that you want to predict. - graincol: a string, defaults to None. Name of possible GrainID column in your dataset. If specified, this column will be removed, as it won't help the algorithm. - impute: a boolean. Whether to impute by replacing NULLs with column mean (for numeric columns) or column mode (for categorical columns). - debug: a boolean, defaults to False. If TRUE, console output when comparing models is verbose for easier debugging. - windowcol: a string. Which column in the dataset denotes which rows are test ('Y') or training ('N').
Example code:
p = DeploySupervisedModel(modeltype='regression',
df=df,
graincol='PatientEncounterID',
windowcol='InTestWindowFLG',
predictedcol='LDLNBR',
impute=True,
debug=False)
Step 3: Create and save the model
The deploy
creates the model and method makes predictions that are
pushed to a database.
- Return: an object.
- Arguments: : - method: a string. If you choose random forest, use 'rf'. If you choose to deploy the linear model, use 'linear'. - cores: an integer. Denotes how many of your processors to use. - server: a string. Which server are you pushing predictions to? - dest_db_schema_table: a string. Which database.schema.table are you pushing predictions to? - trees: an integer, defaults to 200. Use only if working with random forest. This denotes number of trees in the forest. - debug: a boolean, defaults to False. If TRUE, console output when comparing models is verbose for easier debugging.
Example code:
p.deploy(method='rf',
cores=2,
server='localhost',
dest_db_schema_table='[SAM].[dbo].[HCPyDeployRegressionBASE]',
use_saved_model=False,
trees=200,
debug=False)
Full example code
```python from healthcareai import DeploySupervisedModel import pandas as pd import time
def main():
t0 = time.time()
# Load in data
# CSV snippet for reading data into dataframe
df = pd.read_csv('healthcareai/tests/fixtures/HCPyDiabetesClinical.csv',
na_values=['None'])
# SQL snippet for reading data into dataframe
# import pyodbc
# cnxn = pyodbc.connect("""SERVER=localhost;
# DRIVER={SQL Server Native Client 11.0};
# Trusted_Connection=yes;
# autocommit=True""")
#
# df = pd.read_sql(
# sql="""SELECT *
# FROM [SAM].[dbo].[HCPyDiabetesClinical]""",
# con=cnxn)
#
# # Set None string to be None type
# df.replace(['None'],[None],inplace=True)
# Look at data that's been pulled in
print(df.head())
print(df.dtypes)
# Drop columns that won't help machine learning
df.drop('PatientID', axis=1, inplace=True)
p = DeploySupervisedModel(modeltype='regression',
df=df,
graincol='PatientEncounterID',
windowcol='InTestWindowFLG',
predictedcol='LDLNBR',
impute=True,
debug=False)
p.deploy(method='rf',
cores=2,
server='localhost',
dest_db_schema_table='[SAM].[dbo].[HCPyDeployRegressionBASE]',
use_saved_model=False,
trees=200,
debug=False)
print('\nTime:\n', time.time() - t0)
if __name__ == "__main__":
main()
``````