Extracting Knowledge from Data Rich Environments and The Power of MLCircuit:

MLCircuit is web-based software for developing and implementing Machine Learning based applications and specifically Machine Learning (ML) models. It has been developed internally for use by the InterImage ML/AI/RPA teams.

First Some Important Background

What are Machine Learning Models vs Statistical Models vs Data Modeling?
Machine learning models find patterns in data in order to predict information based on those patterns. Machine learning models (aka ML models) are sometimes developed using statistical modeling (e.g. regression), while other times developed using other mathematical and algorithmic based methods (e.g. decision tree and deep learning).

Terms like statistical models and data models can confuse the topic of machine learning but are different from machine learning models and defined as follows.

  • Statistical models are based on a subfield of mathematics and used to approximate trends by finding relationships between variables.
  • Data models are used in software development to define the transfer, relation, and storage of data in an information technology solution.

Why Machine Learning?
Organizations are sitting on knowledge gold but don’t know how to mine it, meaning that business programs can make better decisions with improved data understanding that is not possible without working with highly complex data at scale. Analyzing complex, high-volume data traditionally (without Machine Learning) requires significant human resources and even then, the deeper knowledge contained therein is often still inaccessible. Machine learning tackles this problem by using mathematical and algorithmic methods to find patterns in data that can then be used to improve decision-making, which requires less in terms of human resources.

Machine Learning Steps

  1. Identify requirement/problem/question.
  2. Create and configure a development and data environments.
  3. Select and upload data.
  4. Train algorithms to find patterns in data.
  5. Review the patterns found to evaluate the validity or use to solve a specific problem.
  6. Host the pattern recognition solution (i.e. ML model) in order to integrate it into existing business processes or infrastructure to take actions and make decisions based on the patterns found.

Challenges with Machine Learning and Developing Models

  • Data is the core in developing ML Models. When the data is limited or contains errors, ML models can be developed, but the patterns found are not reliable or usable.
  • Speed to model development is a challenge as may activities include programing line-by-line in a command prompt and running code multiple times with small changes.

Using MLCircuit we build ML Models faster than anyone can with typical ML model development methods (e.g. coding in command line prompt vs using a GUI interface with automated code generation), by orders of magnitude. This means we can build more ML Models in the same time period and we can in minutes make significant modifications to those models. The more ML models developed, the easier it is to identify which ones are better at finding patterns in data, and which of those patterns most effectively bring into the light an organization’s own hidden knowledge.

How Does MLCircuit Build ML Models Faster Than Traditional Methods?

  • MLCircuit speeds up the development process by automating many of the Machine Learning steps including:
    • Environment setup
    • User management/security
    • Data source connections (e.g. SQL, CSV, etc.)
    • Generates code to train algorithms
    • Stores models with version control
    • Generates visualizations and data to evaluate ML Models
    • Generates APIs for ML model integration with business workflows
  • MLCircuit is a web-based application in which users interact with the solution through a GUI interface (i.e. browser) to build, evaluate, and deploy ML models
  • MLCircuit is cloud based and CSP agnostic

MLCircuit in Detail

MLCircuit is an internally developed tool achieving the primary goal of enabling our team to rapidly build, evaluate and deploy machine learning models. The main philosophy behind the tool is speed to production. What this means is that it should take significantly less time to build, evaluate, deploy and continually improve machine learning algorithms. It is very expensive to train resources how to properly build and maintain machine learning models since so many moving parts are involved.

The first step in any machine learning model development process always involves retrieving data from identified sources. In many cases, all the data needed to solve a particular problem lives in many different places and is generated by many different systems. One of the goals of MLCircuit was to reduce the time it takes to retrieve data from many disparate, non-standardized sources and store this data in an integrated and useful form.

Internally, we found ourselves constantly re-using the same set of code to pull from various data sources, but never had any mechanism to maintain configurations so that downstream tasks that relied on this data would have reproducible results. To combat this issue, MLCircuit comes with connectors that allow us, for example, to upload a CSV file, write custom query statements that can be executed against a SQL Server and a generalized API call that allows us to retrieve records directly from systems using APIs. MLCircuit exposes these connectors through a web interface that is accessible to our team.

The function of data standardization and storage is critically important. Without standardized and efficient ways to access data, it is impossible to do any machine learning. As part of the data integration process, MLCircuit automatically ensures proper data format, infers datatypes, and stores data in a manner that scales well as the amount of data increases.

In recent tests, we found out that MLCircuit was able to stream over 1 gigabyte of data to its model training engine in less than 20 seconds with low processor overhead. We can do this because under the hood MLCircuit uses a data table specification that frees the data from being tied to any specific programming language and therefore the platform both streams and stores large amounts of data rapidly and efficiently. Furthermore, our experts are able to leverage this component of MLCircuit even if the platform cannot fulfill all the immediate model development requirements. Having this kind of augmented data standardizing capability makes MLCircuit incredibly valuable as it assures that our developers do not have to create a new and unique data storage and standardization plan for every task they are trying to solve, as has universally been the case.

Often, a machine learning developer will run hundreds of iterations on a model before they are satisfied that they squeezed all the performance possible out of it. This is incredibly time consuming and often will require various implementations of data processing algorithms and models. When a developer tries to implement a completely different model, they may have to start from scratch with their data cleansing steps. MLCircuit makes it incredibly easy to try various combinations of models and will save all hyperparameters and evaluation metrics of these models so that results can be compared, and the best model will not be missed or forgotten about. We literally reduce months to days and years to weeks.
To do this, MLCircuit provides our team a drag and drop interface with a sophisticated library of different modules can be dropped and linked together on a canvas that resembles a circuit board (hence “MLCircuit”). The circuit itself is disguised as a computational, noded data flow transformation graph. This framework enables us to constantly add more advanced modules as we encounter more common business use cases. The framework allows our team to test specific nodes without performing a full training run which saves massive amounts of time especially for circuits that use an ensemble of models to accomplish a task.

Our specification facilitates plans to scatter the computation graph across multiple nodes in our cloud environment to enable the platform to scale to true big data workloads. Additionally, “ML Templates” within the platform allow our team to deploy common models and machine learning tasks with just a few clicks. An example of one of these templates is called an Auto Classifier. This classifier will automatically preprocess and search over various classifiers and hyperparameters to find the best model for the data in any given situation. To our experienced ML developers, this is so easy that it almost feels wrong.
Even if a model is trained and is performing well, it needs to be deployed to really start providing any value. This can take a significant amount of time especially when the model has many components that each need to be saved and deployed in a robustly integrated manner. Some of the key parts that are often overlooked when it comes to model deployment are schema validation, security, scalability, and versioning. MLCircuit handles all these parts automatically with one click to deploy the model we are most satisfied with. Rather than investing an entire operations team into deploying and maintaining models in production, MLCircuit tracks the deployed versions of every circuit and allows our team to quickly upgrade the models whenever we choose.