Developing and comparing models

What is DevelopSupervisedModel?

  • This class let's one create and compare custom models on diverse datasets.
  • One can do both classification (ie, predict Y/N) as well as regression (ie, predict a numeric field).
  • To jump straight to an example notebook, see here

Am I ready for model creation?

Maybe. It'll help if you follow these guidelines:

  • Don't use 0 or 1 for the independent variable when doing classification. Use Y/N instead. The IIF function in T-SQL may help here.
  • Don't pull in test data in this step. In other words, we just pull in those rows where the target (ie, predicted column has a value already).

Of course, feature engineering is always a good idea.

Step 1: Pull in the data

For SQL:

import pyodbc
cnxn = pyodbc.connect("""SERVER=localhost;
                        DRIVER={SQL Server Native Client 11.0};
                        Trusted_Connection=yes;
                        autocommit=True""")

 df = pd.read_sql(
     sql="""SELECT *
            FROM [SAM].[dbo].[HCPyDiabetesClinical]""",
     con=cnxn)


 # Handle missing data (if needed)
 df.replace(['None'],[None],inplace=True)

For CSV:

df = pd.read_csv('healthcareai/tests/fixtures/HCPyDiabetesClinical.csv',
                 na_values=['None'])

Step 2: Set your data-prep parameters

The DevelopSupervisedModel class cleans and prepares the data before model creation

  • Return: an object.
  • Arguments: : - modeltype: a string. This will either be 'classification' or 'regression'. - df: a data frame. The data your model will be based on. - predictedcol: a string. Name of variable (or column) that you want to predict. - graincol: a string, defaults to None. Name of possible GrainID column in your dataset. If specified, this column will be removed, as it won't help the algorithm. - impute: a boolean. Whether to impute by replacing NULLs with column mean (for numeric columns) or column mode (for categorical columns). - debug: a boolean, defaults to False. If TRUE, console output when comparing models is verbose for easier debugging.

Example code:

o = DevelopSupervisedModel(modeltype='classification',
                           df=df,
                           predictedcol='ThirtyDayReadmitFLG',
                           graincol='PatientEncounterID', #OPTIONAL
                           impute=True,
                           debug=False)

Step 3: Create and compare models

Example code:

# Run the linear model
o.linear(cores=1)

# Run the random forest model
o.random_forest(cores=1)

Go further using utility methods

The plot_rffeature_importance method plots the input columns in order of importance to the model.

  • Return: a plot.
  • Arguments: : - save: a boolean, defaults to False. If True, the plot is saved to the location displayed in the console.

Example code:

# Look at the feature importance rankings
o.plot_rffeature_importance(save=False)

The plot_roc method plots the AU_ROC chart, for easier model comparison.

  • Return: a plot.
  • Arguments: : - save: a boolean, defaults to False. If True, the plot is saved to the location displayed in the console. - debug: a boolean. If True, console output is verbose for easier debugging.

Example code:

# Create ROC plot to compare the two models
o.plot_roc(debug=False,
           save=False)

Full example code

Note: you can run (out-of-the-box) from the healthcareai-py folder:

from healthcareai import DevelopSupervisedModel
import pandas as pd
import time

def main():

    t0 = time.time()

    # CSV snippet for reading data into dataframe
    df = pd.read_csv('healthcareai/tests/fixtures/HCPyDiabetesClinical.csv',
                    na_values=['None'])

    # SQL snippet for reading data into dataframe
    import pyodbc
    cnxn = pyodbc.connect("""SERVER=localhost;
                            DRIVER={SQL Server Native Client 11.0};
                            Trusted_Connection=yes;
                            autocommit=True""")

    df = pd.read_sql(
        sql="""SELECT *
            FROM [SAM].[dbo].[HCPyDiabetesClinical]
            -- In this step, just grab rows that have a target
            WHERE ThirtyDayReadmitFLG is not null""",
        con=cnxn)

    # Set None string to be None type
    df.replace(['None'],[None],inplace=True)

    # Look at data that's been pulled in
    print(df.head())
    print(df.dtypes)

    # Drop columns that won't help machine learning
    df.drop(['PatientID','InTestWindowFLG'],axis=1,inplace=True)

    # Step 1: compare two models
    o = DevelopSupervisedModel(modeltype='classification',
                            df=df,
                            predictedcol='ThirtyDayReadmitFLG',
                            graincol='PatientEncounterID', #OPTIONAL
                            impute=True,
                            debug=False)

    # Run the linear model
    o.linear(cores=1)

    # Run the random forest model
    o.random_forest(cores=1,
                    tune=True)

    # Look at the RF feature importance rankings
    o.plot_rffeature_importance(save=False)

    # Create ROC plot to compare the two models
    o.plot_roc(debug=False,
            save=False)

    print('\nTime:\n', time.time() - t0)

if __name__ == "__main__":
    main()