SOLID Machine Learning

The SOLID principles are Object Oriented Programming (OOP) principles designed for writing Clean Code. According to Robert C. Martin’s “Clean Code” book, in order to keep a good software development pace, it’s important for a team to write Clean Code. By implementing ugly code, you can surely come up with decent results quickly, but in the long run, your project will end up crashing into a wall.

It is known that a software dies when it’s no longer possible to understand it or add new features to it in a reasonable period of time. Keeping things clean is crucial to successfully convey any software project. This line of thought applies especially to machine learning projects that involve taking solutions from research or academic projects and implementing them into a production environment.

Research code is known to be dirty. Why? Because it’s often directed towards one goal only: making proofs of concept (POC) as quickly as possible. In this case, there’s not much incentive to keep one’s code clean. But if you plan on building a maintainable and scalable product, dirty code will not last long.


“SOLID”, you said?

Here are the SOLID principles, in addition to the Tell Don't Ask (TDA) principle which is also relevant.

  • S: Single Responsibility Principle (SRP).
    Every class has one responsibility only.
  • O: Open-Closed Principle (OCP).
    Classes are closed to modification and open to extension.
  • L: Liskov Substitution Principle (LSP).
    Subtypes should be properly interchangeable with the base type.
  • I: Interface Segregation Principle (ISP).
    Separate interfaces across your objects rather than implementing only a big one.
  • D: Dependency Inversion Principle (DIP).
    Depend on abstractions instead of concrete implementations.
  • Also: Tell Don’t Ask (TDA).
    Don’t manage an object from the outside, rather tell it what to do.

Let’s dive deeper into these concepts.


S: Single Responsibility Principle (SRP)

This principle states that every class has one responsibility only. For instance, if you’re tempted to name a class SomethingAndSomethingElse, it’s a sign that the class should probably be split into two.

Separating classes to give them one responsibility each will ease things when time comes to change code. Not only it will be easier to implement changes, but the different parts will be more reusable and malleable. It‘s an easy and safe way to code successive prototypes. It also makes the test protocol smoother by allowing to test one thing at a time. So when an error pops up, you can easily pinpoint the problem.

In addition, when several developers work together on the same project, respecting the SRP reduces the chances of two developers stepping on each other’s toes (modifying the same object at the same time). Imagine that the SomethingAndSomethingElse class has the responsibility of doing two different tasks (the task Something and the task SomethingElse) and that these tasks need to be refactored. If two developers take care of one task each, they will find themselves modifying the same object at the same time, which will probably cause a conflict in the code when the time comes to merge the two solutions in the project. On the other hand, if SomethingAndSomethingElse is split into two classes in respect of SRP, the two developers can work on two different objects thus avoiding the conflict. Handling conflict in code is generally time consuming and very prone to bugs. This phenomenon should be avoided as much as possible.

Let’s apply SRP to a machine learning context. Suppose we have a class named MyNormalizerAndMyModel. This design pattern is most likely to be too rigid, too specific and not malleable. Here’s what could (and should) be done. First, a class named MyModel would be made. If you think it’s part of the job of MyModel to normalize data, a method named ‘’normalize’’ would be implemented inside of it. If not, another class named MyNormalizer (which would inherit from the same superclass as MyModel) should be implemented. This class would implement the method ‘’normalize’’. Another class named MyPredictor could then be implemented which would ingest data and one or more models to make predictions. MyPredictor would therefore represent what is called a Pipeline. It could encapsulate objects of type MyNormalizer, MyModel, etc. Other models could be implemented and used in the pipeline by simply swapping MyModel with another class representing another model. This way, it is easy to reuse some objects in different contexts without having to duplicate code or modify an existing model (also see Open-Closed Principle - OCP).

Using a framework like Neuraxle (which is open source) greatly facilitates the construction of pipelines.

This principle also applies to data loading. It’s better to give the responsibility of loading data to a class apart from our machine learning algorithm. You would want to pass already loaded and crunched data to a pipeline made of model instances, so that the source of the data and the way to load it (there can be a lot) would not matter for the pipeline. This is also closely related to the Dependency Inversion Principle - DIP.

Another example of properly respecting the SRP in Machine Learning pipelines is to let each step of a pipeline define its own way to initialize itself, to save itself, and to delete itself. This is especially important when implementing deep learning for instance. In that case, some of the pipeline steps might use GPUs as others might use CPUs (Such resource management is handled by Neuraxle’s pipelines). It’s therefore important for each step to be able to manage itself. Managing the specifics of each step of a pipeline from outside the pipeline would break not only the SRP but also the Tell Don’t Ask (TDA) principle.

Now, suppose that you’re trying to find the best hyperparameters for the different models of a pipeline (this process is called Automated Machine Learning - AutoML). In such a case, it is crucial for each object inside a pipeline to be able to manage its own loading, saving, initialization and deletion in order to effectively allocate and deallocate resources. It would be a nightmare to manage that process outside the model or worst, outside the pipeline. A major default of scikit-learn is its inability to manage this process. It’s also important to respect the SRP and OCP when serializing and saving machine learning pipelines containing code that can’t be saved natively by the python interpreter (e.g.: here is how Neuraxle solves the issue of using GPU resources, making use of proper abstractions that respects SRP and OCP).

O: Open-Closed Principle (OCP)

This principle states that a class should be closed to modification, and open to extension. Which means that if a new functionality has to be added to a project, you shouldn’t have to go back to existing classes to edit stuff that is already working.

The most obvious example of not respecting this principle is to include long if/else statement chains (switch cases) in a single piece of code. Having too many if/else encapsulated in such a blob is called “poor man’s polymorphism”. Polymorphism is the ability for an object to inherit from others and take other forms in order to provide different functionalities. Rather than coding a lengthy if/else block at a particular point in a pipeline, it’s advised to implement an abstract class that is open to modifications (to polymorphism, in this case). All the different cases supposedly handled by an if/else block can then be incarnated by as many different sub-classes as needed, all inheriting from the previously built abstract class. Such a sub-class can then override certain methods of its parent abstract super-class as you see fit. By doing that, there’s no need to modify the abstract super-class or any of its sub-classes when implementing new functionalities. You would just implement a new sub-class inheriting from the same super-class, and inject it where you see fit (see the section about the DIP for more information about injection).

Let’s illustrate that principle with the example of loading data as a step of a machine learning pipeline. It’s certainly not a good idea to simply implement a function called “load_data” which would have a huge if/else block to cover all the possible cases like fetching data from a local repository, an external repository, an SQL table, a CSV file, an S3 bucket, etc. It would be better to implement an abstract class called DataLoader (or DataRepository). It would then be possible to implement as many concrete subclasses as needed to mimic the different ways of loading data. When the time comes to add a data loading step in a pipeline, the right sub-class (inheriting from DataLoader) suiting the context could be picked, so the pipeline doesn’t go through a useless chain of if/else statements to decide how to load data.

Here is another example to take this concept a step further. In compliance with OCP, it is possible to implement an effective strategy for choosing the right hyperparameters in a machine learning pipeline. Imagine that at a certain stage of a pipeline, you have to choose between the preprocessing method A and the preprocessing method B. The most tempting strategy (without considering OCP) would undoubtedly be to implement a full pipeline using method A and to implement another full pipeline using method B. We would then switch from one pipeline to another directly in the main file using a for loop. At first glance, this sounds like a good idea and it is the approach that most programmers and/or researchers in data science would recommend. The problem is that as new methods and models are to be implemented and tested, the main file must be modified to include a bigger and bigger polymorphism. This approach will therefore most likely generate a poor man's polymorphism directly in the main file. The main file is actually the worst place to implement this kind of polymorphism because it is outside the source project and its code is not encapsulated in objects. In a machine learning project, the main files created here and there (outside of the source code) as the project progresses are very rarely maintained and updated. So code implemented in main files is often lost from one context to another. A better strategy in this case would be to create a class (which would be a pipeline step), in the source code of course, whose job would be to select a model from a list of models, then to optimize itself by choosing the best model according to a certain metric in an AutoML loop. This class would be closed to modification and open to extension and any pipeline implemented in a main file could use it. This approach also respects the Interface Segregation Principle (ISP).

Neuraxle library has tools to implement this kind of strategy. The code to define such a pipeline to optimize reads like:

CODE: https://gist.github.com/jeromebedard12/30d17fb53fe2f1106d44fd430dbdc17e.js


According to the OCP, each step of a machine learning pipeline should be responsible not only for defining its own hyperparameters, but also for its hyperparameter space. If it is a wrapper, it must also be responsible for managing the space of the objects it wraps. For example, in the code above, the ChooseOneStepOf step not only changes the data flow, but also chooses the object to use (and thus the hyperparameter subspace to use). This process would be impossible if the OCP was not respected and the hyperparameters were hardcoded in the object, in which case manual modifications would have to be made continuously to explore different hyperparameters.

L: Liskov Substitution Principle (LSP)

Subtypes should be properly interchangeable with the base type and vice-versa

Suppose you have different data loaders for example. According to LSP, they should all obey the same interface. It means that if they inherit from an interface, they should respect the interface’s planned behavior without surprises. Such interfaces are also called contracts in the OOP world. “Programming by contract” is a principle directly linked to LSP. In the case of class inheritance, it means that an object must respect the rules it inherits from by polymorphism, and it should not betray the base object’s specifications.

The simplest example of breaking this principle is the rubber duck example. Let’s say you implement a class named RubberDuck (representing a rubber duck) and you make it inherit from the class Animal (representing a living animal). According to LSP, this question arises: is RubberDuck a proper substitute for Animal? I.e.: can a RubberDuck instance “behave” like an Animal instance without breaking any of the Animal’s rules? The answer is no. For instance, a function call like RubberDuck.eat(food) is most likely to be broken. Either the rubber duck can’t eat food and fakes being an animal, or it’s an impostor that should be renamed because it’s alive and therefore the rubber duck concept is misused. Faking to be an animal requires extra mechanisms to handle the imitation, thus breaking the contract. Breaking LSP in this way leads to dirty code that is also prone to bugs.

In the case of a machine learning pipeline, this principle means that a particular step should follow the same basic rules of any other step. A pipeline can contain sub-pipelines (nested pipelines) as well. Those nested pipelines should also follow the rules of a step. In fact, a pipeline should be considered as a step. For instance, you should be able to replace a preprocessing method with another one without breaking the data flow (according to a certain data shape) or the pipeline behavior.

Let’s look back at the code example in the OCP section above. If LSP is respected, the step “YourPreprocessingA” could be replaced by the step “YourPreprocessingB” without any problem occurring. In the same way, “TrainOnlyWrapper(DataShuffler())” could also be replaced by “DataShuffler()” without breaking the data flow or expected behavior of the pipeline.


I: Interface Segregation Principle (ISP)

Do not implement a huge single interface, but rather distribute several interfaces across different objects.

Let's go back to the previous example of the rubber duck given in the LSP section. A rubber duck isn’t an animal and shouldn’t inherit from Animal. But still the RubberDuck class “wants” to do so, as it reuses some of the behavior defined in a Duck class for example (which would inherit from Animal). A solution here is to refactor the Duck class so it doesn’t inherit only from Animal, but also from a new WaterFloater base class. This way, both Duck and RubberDuck can inherit from WaterFloater. It’s therefore possible to reuse some of the Duck’s behavior in the RubberDuck, without breaking the LSP (without stating that the RubberDuck is an Animal or a Duck). Segregating interfaces this way allows for proper code reuse between various classes.

In the case of a machine learning pipeline, various interfaces must be implemented for various tasks. All the steps of a pipeline as well as the pipeline itself may inherit from the same abstraction (a BaseStep class for instance) that makes steps objects properly substitutable. However, some objects like TrainOnlyWrapper and Pipeline itself both wrap (nest) other step objects. According to the ISP, both TrainOnlyWrapper and Pipeline could inherit from a new class ‘’on the side’’. This way, BaseStep is not unnecessarily burdened by being forced to adopt a “nested” step behavior, especially since it’s not its responsibility to do so (see SRP above). With such recursive methods, it would be possible for instance to call “pipeline.get_hyperparams()” from outside the Pipeline and letting the Pipeline dig into all its nested steps to seek the full tree of hyperparameters.

Separating interfaces facilitates code reuse by combining different behaviors. Proper respect of the ISP principle will make your codebase smaller, simpler, more modular and without code duplication. Extract a common abstract class (interface) out of similar objects in order to allow those objects to be interchangeable in some situations - such as Duck and RubberDuck can be substituted from one another when it comes to float on water. In the same way, a Pipeline and any other single step Wrapper (like TrainOnlyWrapper) have the ability to dig into their wrapped (nested) steps to set or retrieve info (like hyperparameters) or to serialize their steps. They both are interchangeable as objects having the ability to nest other objects.


D: Dependency Inversion Principle (DIP)

Depend on abstractions instead of concrete implementations.

This one is particularly important. Not respecting this principle can easily kill a machine learning project, or at least make it very hard to deploy new models to production.

Let’s say for instance that while building a prototype, the data is loaded at the same level of abstraction where the model is saved. It would mean that in a single method (function), data is loaded, it is processed, it is passed into a model, some performance metrics are calculated and printed and the model is saved on a particular repository (like in a dirty Jupyter Notebook, without using any classes, which is often seen while prototyping and playing around in a research context).

Suppose you want to deploy such a prototype in production. It would imply connecting to a new data source and saving and/or loading the model and the performance metrics in a particular repository (which is different from the one used in the experimentation). All that while keeping in mind that data sources and repositories for models and metrics are subject to changes and redesign in the future. If dependency inversion wasn’t done, there’s a problem. If the problem is not handled in a proper way, a huge poor man’s polymorphism (OCP section) is most likely to appear everywhere in the code and a tremendous amount of time will be wasted trying to keep the code alive and functional as you go through the production deployment process. The only viable long-term solution is to create abstract classes for data loading, pipeline steps, pipelines themselves and so forth. For instance, a data loading class specific to the initial prototype can be implemented. Then, when the time comes to connect to production data, another data loading class can be implemented to be substituted for the previous one in the pipeline transparently. These two classes could inherit from an abstraction called DataLoader.

Also, the data loading class should not be instantiated at the same level of abstraction as the model is implemented (inside the model itself). The proper way to implement a dependency inversion as per the DIP would be to pass an instance of a DataLoader to a model. Concrete objects must be passed to a higher level rather than creating a class instance inside another class. Here is a code example where the data is prepared in advance as an iterable, and then sent into a pipeline.

Moreover, if caching is needed throughout the pipeline, special checkpoint steps should take care of this job independently from regular steps (which also respects the SRP). The path to the location on disk for the caching should be managed by the pipeline itself via a context object. This object should have a function (get_path for instance) returning the path. This path can then be passed to the inner (nested) steps of a pipeline as a root directory for them to save or load their cached stuff.


Tell Don’t Ask (TDA)

This one is also very important to understand. The TDA is there to avoid ending up with leaky abstractions. An abstraction is leaky when you have to get content out of the object to manipulate it. Having too many getters in an object is a good sign of not respecting TDA, although in some contexts it’s normal to have a lot of getters. Data Transfer Object (DTO) is a good example of an object having little internal logic, having a lot of getters and setters and simply used to store attributes in bulk.

Here is an example of non-compliance with TDA:

  1. Get some attributes from an object.
  2. Combine those attributes to update them or do something new.
  3. Set the result back in the object.

Here’s the right way to meet TDA:

  1. Simply tell your object what to do and pass the necessary extra stuff as arguments in the method (also notice DIP here).

So whenever a task needs to be repeated, instead of duplicating code to do the task outside the object, a single method which has been implemented directly in the object is simply being called. Duplicated code is one of the hardest things to manage and it also breaks the OCP.

When applied to machine learning pipelines, you want each step to take care of itself properly and independently, rather than having to micromanage it. You don’t want to be constantly digging into a pipeline to edit it. Instead, have your steps do the right thing at the right moment. For instance, let’s say a pipeline has to be set to test mode. With a simple line of code, you should be able to call a method that would handle that task. This task would consist, for example, in disabling steps wrapped in the TrainOnlyWrapper and enabling steps in the TestOnlyWrapper. Each step should be responsible for updating itself following its own logic. You don’t want to dig in the pipeline’s objects to change a boolean (is_train for example). Breaking the encapsulation of a pipeline would not only lead to dirty code, but also be more prone to bugs. For instance, adding a second and third modification triggered in test mode would require digging three times in the whole pipeline if the TDA is not respected. The heavier the micro-management, the higher the risk of bugs appearing and the higher the programming time. It’s better to give the pipeline and pipeline steps the necessary internal logic to switch to test mode transparently.

This article from Martin Fowler gives a deeper understanding of the TDA.

Conclusion

The SOLID principles (as well as TDA) of OOP have been detailed and examples of applications for machine learning pipelines were exposed. This article demonstrates the importance of coding a machine learning pipeline in a clean and structured way. The world of software development is in constant evolution, so are the concepts presented here. The methods shared in this article are for now, to the best of our knowledge, the right way to code clean machine learning pipelines in Python. At Umaneo, we have addressed these challenges by leveraging the open-source library Neuraxle.


Continue reading