Big data, big data, big data everywhere. Since the explosion of smartphones and later of IoT devices, we have billions of devices in the world constantly generating data: our communications, our geographical position, data on our vital signs and our health, traffic, weather sensors, etc. All this enormous amount of data must be processed to be useful, and the greatest use that can be given to this data is to make predictions.
This is where Machine Learning comes into play, and Python has emerged as the undisputed leading programming language in this field. Python is not the fastest, nor the most powerful, nor the most beautiful. But it has something unique: its relationship between ease of learning and capabilities. It is an easy to understand language, relatively simple to learn and with an army of libraries and modules that make everyday data processing tasks much easier. I think that’s why the scientific community and people interested in Machine Learning have adopted it as the de facto language for data analysis.
1. Training our model
In this article I’m going to explain how to train a Python model to calculate the price of a house in relation to its square meters using already known data, and most importantly, save this model to be able to use it later in other programs. The important thing here is not the operation of the model itself, but the export of it so that we do not have to train it every time we want to make this calculation. In this example we use few samples, but the training data can have thousands or even millions of lines. Training them every time a prediction is required would imply a considerable waste of time and energy.
Let’s start by defining our dataset in a CSV file. This is the actual dataset we’re going to use to train our model how to calculate prices:
area,price 60,150000 120,265000 150,300000 260,520000 300,625000
To train our model we will use the following libraries:
- Pandas: we’ll mainly use it to load our dataset
- Scikit-Learn: the most extended library of Machine Learning for Python
The following script train_area_model.py shows how to load our dataset into memory and use its data to train a linear regression model that calculates the price of a house based on its area.
import pandas as pd from sklearn import linear_model # load the dataset into memory training_dataset = pd.read_csv("area.csv") # create a model using the linear regression algorithm # and train it with the data from our csv regression_model = linear_model.LinearRegression() print ("Training model...") # model training regression_model.fit(training_dataset[['area']], training_dataset.price) print ("Model trained.") # ask user to enter an area and calculate # its price using our model input_area = int(input("Enter area: ")) proped_price = regression_model.predict([[input_area]]) print ("Proped price:", round(proped_price[0], 2))
If we save it in the same directory as our dataset area.csv and run it, we can ask our model what is the right price for a 105 square meter house:
$ python3 train_area_model.py Training model... Model trained. Enter area: 105 Proped price: 229569.05
We now have a program that trains an automatic learning model that predicts house prices based on square footage.
Let’s say I want to use this model in other programs, or I want to share it with my colleagues so they can use it in their applications and benefit from the experience gained with my data, but we don’t want to retrain our model every time we run the program and we don’t want to share our dataset. How would we do that?
2. Exporting our model to an external file
Next we will see how to export our trained model to use it in other programs. The first step is to modify our train_area_model.py script so it removes the question from the user and only saves our model in a file. We’ll use the pickle library to serialize our model so we can save it as a binary file. Let’s see how we should modify our script:
import pandas as pd from sklearn import linear_model import pickle # load the dataset into memory training_dataset = pd.read_csv("area.csv") # create a model that uses the linear regression algorithm # and train it with our dataset regression_model = linear_model.LinearRegression() # train the model print ("Training model...") regression_model.fit(training_dataset[['area']], training_dataset.price) # serialize our model and save it in the file area_model.pickle print ("Model trained. Saving model to area_model.pickle") with open("area_model.pickle", "wb") as file: pickle.dump(regression_model, file) print ("Model saved.")
As we’ll see if we run our script again, the file area_model.pickle will be generated.
$ python3 train_area_model.py Training model... Model trained. Saving model to area_model.pickle Model saved. $ ls -l total 24 -rw-r--r-- 1 maf staff 65 4 abr 14:04 area.csv -rw-r--r-- 1 maf staff 514 4 abr 16:10 area_model.pickle -rw-r--r-- 1 maf staff 657 4 abr 14:27 train_area_model.py
The new file area_model.pickle is a binary representation of our trained model that we can load into any other script or program to use as we wish.
3. Importing our trained model into another program
If we want to use our trained model in other scripts or programs we must load the file generated in the previous step. Let’s see how to do this with a new script called predict_pryce.py:
import pickle # Import our model with open('area_model.pickle', "rb") as file: regression_model = pickle.load(file) # Ask the user to enter an area and calculate # its price using the imported model input_area = int(input("Enter area: ")) proped_price = regression_model.predict([[input_area]]) print ("Proped price:", round(proped_price[0], 2))
Done! Attention because as you see it is no longer necessary to load the training dataset, nor import the sklearn library, nor train any model, making our program much faster, lighter and transportable. We run our new script and voilà:
$ python3 predict_price.py Enter area: 105 Proped price: 229569.05
The good part of this is that the model no longer needs to be trained, it is ready for use by anyone. You could share it with your colleagues so they could implement their predictions based on your model and dataset, and all this without sharing your training data which, in many cases, could be private.
This technique is applicable to any kind of algorithm supported by sklearn or other libraries, even outside the scope of Machine Learning, to serialize and save python objects and use them later in other programs and scripts