Hints and tips
Gathering the data
If you have interesting data in a CSV file or even a cross serveral databases on a single server, you are in good shape. While it's easiest to pull data into the package via a single table, one can also use joins to gather data from separate tables or databases. What's most important is the following:
- You have a column you're excited about predicting and some data that might be relevant
- If you're predicting a binary outcome (ie, 0 or 1), you have to convert the column to be Y or N.
Pre-processing
It's almost always helpful to do some feature engineering before creating a model. Here are some practical examples of that:
- If you think the thing your predicting might have a seasonal pattern, you could convert a date-time column into columns representing DayOfWeek, DayOfMonth, WeekOfYear, etc.
- If you have rows with both a latitude and longitude, it may be beneficial to add a zip code column (for example)
Model building tips
- Start small. You can often get a good idea of model performance by starting with 10k rows instead of 1M.
- Don't throw out rows with missing values. We'll help you experiment with imputation, which may improve the model's performance.
- Focus on new features. Rather than finding more rows of the same columns, finding better columns (ie, features) will give better results.