This article is a snippet from my university project on smart city technologies, if you are interested in reading the full report, don’t hesitate in contacting me at “@parismollo” (Github, Twitter, or LinkedIn).
Disclaimer: The app is no longer at v0.4.1, the content of this article can slightly differ from the latest version of the app. In addition to that, I’m no longer maintaining or updating the project, you may face bugs if you access it.
Try the code
Note: The application works both on the mobile and desktop browser.
To test the project, access https://envai.herokuapp.com/ and wait a few seconds while the application boots. The landing page will have the necessary information to test the app and a description of the project. To test the machine learning models, go to the “Try it out Section” and load an image of your choice.
After loading the image, the app will start loading all the 4 models and this may take a while (a few seconds), once the models are ready a “success” message will appear on your screen and the results will appear on the left sidebar menu.
Attention: If you are testing the app via mobile, the sidebar menu is by default closed, to open it, click on the top left black arrow.
The rise of smart cities will accelerate the development of new technologies and spark the next generations to build solutions for the challenges discussed in the previous section, that said, I decided to build a project prototype that could illustrate a possible use case of Machine Learning for security and safety in a smart city, particularly for fire hazard detection.
What problem does this project tackle ?
The fundamental ingredient of security and safety improvement in cities is information. The more information, as accurate and fast as you can get it, the greater the chances to intervene, control, and avoid criminal, illegal activities, and hazardous situations in the city.
As cities expand fast, it turns out to be very difficult to collect enough information on time to control and supervise the entire city. The 21st century’s solution for that is cameras and sensors disseminated over the city. However, cameras have limits, when their quantity increases it’s almost impossible computer-wise to save all the footage recorded by each camera and humanly impossible to read and analyze this tremendous data collection in real-time. Sensors, on the other hand, can trigger a lot of false positives and have often a very limited range and motion. Although current CCTV and sensor systems are very valuable and are helping local authorities battle crime, violence and hazards, there is still a margin for improvement. Machine learning algorithms can optimize the surveillance system by preprocessing most of the footage and identify what’s more likely to be pertinent for further analysis or intervention, significantly reducing the workload for the surveillance authorities.
What I wanted to do
The initial concept of this project implied building a web application that could identify fire hazards on images and output a description of the situation portrayed in the image (e.g. environment classification, objects classification, pedestrian detection). An application of this project in a real-life scenario is explained in figure 4.1. The machine learning algorithm would be preprocessing all the footage from CCTV cameras, once the algorithm classifies with greater than 50% confidence probability a hazardous fire scenario, it writes a full description of the situation and it adverts the closest fire department or security authority.
This framework could substantially improve the efficiency of fire fighting in smart cities, it reduces the time to detect hazardous situations, hence it speeds up the fire department time to act. In theory, this architecture could work without human supervision or city dwellers’ emergency calls, but this doesn’t correspond to the current state of technology and security protocols. Soon, it’s likely to have a significant transition where it would be pointless to have human supervision and this type of software would immediately communicate with the fire department
What I did
At the current published version of the project (Alpha v0.4.1) the project is deployed on a Heroku server where users can access and test the machine learning models precision. The Alpha v0.4.1 app has two modes, “demo” and “How it works” mode, at demo mode the user can select a jpg or jpeg file of his choice and load it to the server, next, the app will display 4 models labels on the sidebar menu, where each will output a classification/prediction probability set.
Note: Alpha v0.4.1 has the following models: CIFAR10 object classification, Fire detection, Pedestrian Detection, and Scene classification.
The app also provides an interactive display of each layer of the convolutional neural network output (i.e. feature maps), plus its filters (See section “Convolutional Neural networks”) at app mode “How it works”.
Note: In the original version of the app (Github repo only, not available at Heroku) the user can further inspect the model’s architecture, shown on app mode “evaluation”, in addition to that, the user can evaluate the models’ performance on a set of images pre-loaded in the app.
The app was initially built using TensorFlow v2.2.0, unfortunately, the package size of this version of TensorFlow is around 530MB, which exceeds the “slug size” limit of 500MB for deployment on the Heroku cloud platform, due to this limitation I decided to clone the repository (using Tensorflow v2.2.0) and rollback the cloned version to Tensorflow v2.0.0, which is around 83MB, consequently I could deploy this version to a Heroku server.
The model’s performance varies according to each task, but overall they all reach over 75% accuracy on test sets. However, there is still a lot to be improved and this is far from production required accuracy levels. The datasets sizes were relatively small for some tasks and will need to be improved for the next versions of the project, either by collecting more data or by generating random changes on the train set to increase the model’s generalization.
How it works
The project can be parted into 2 components, Models Research & Development, and Web app development.
To build a sustainable and scalable project, I followed the “step-by-step” organization recommendations proposed by Aurélien Géron on his latest book “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” and Jason Brownlee tutorials.
Step number one consisted in defining my objectives and framing the problem my models would tackle (See Section “What I wanted to do”). Once I defined the model’s tasks (e.g. fire detection), I started my search for available datasets. The Fire, Pedestrian and Scene datasets are public datasets available on Kaggle, while for the object detection I used the CIFAR10 dataset, directly available at Keras.
After finding which datasets I was going to use, I started my development. At Kaggle or Google Colab (for CIFAR10) I wrote a python script that would:
1. Load the data.
2. Preprocess the data (normalization or standardization).
3. Discover and visualize the dataset (shape and instances).
4. Create a machine learning model (Sequential, optimizer, compiler).
5. Train the model on the train set and validate it on a validation set.
6. Evaluate model (overfitting, underfitting, generalization, etc)
7. Fine-tune the model (hyperparameters)
This can be implemented simultaneously in multiple architectures, to evaluate which architecture works the best. Once I found the best performant model, I saved the trained model in the TensorFlow h5 format to be later used in the web app.
There is a lot of relevant information in the steps mentioned above, such as the different techniques used for normalization, the model’s architecture and hyperparameters, optimization algorithms, techniques to reduce overfitting, data augmentation and more, unfortunately, this report will not cover all these topics, but the notebook-code will be adequately commented with a few explications of the techniques that I decided to use in each model.
The priority of this project is the development of machine learning models, however, I believe that it’s important to provide a friendly interface where people can understand how the technology works and test the code, and hopefully give feedback. That said, I decided to use Streamlit, which is the framework I used to build the web app. The web app allows users to interact with the machine learning models, their architectures, and layers.
After saving the trained machine learning models created in the R&D section, I load them on the web app and use it to run predictions from the user input (via Streamlit), once the models output their predictions, I display it on the interface for the user.
So far I’ve presented the overall structure of the project and its features. This segment, nevertheless, will focus on explaining how the machine learning models used in this project work.
Note: This report doesn’t intend to cover details on Machine Learning, we will only be covering the necessary information for this project.
A quick overview of Machine Learning
“A computer program is said to learn from experience E concerning some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”, said Tom Mitchel in 1997.
Tom Mitchel’s statement describes the fundamental mechanisms of machine learning programs. In principle, a machine learning algorithm is a computer program that learns from data (e.g. identifying patterns in images), to complete a specific predefined task (e.g. fire detection).
The ultimate objective of the machine learning model is to find the algorithm parameters that will optimize the model’s performance, so that the model can generalize adequately and perform well on new data.
There are many techniques inside the Machine Learning field, one of the most popular ones are Artificial Neural Networks (ANN) which are vaguely inspired by the human brain’s biological neurons.
Convolutional neural networks
Similarly to ANN, Convolutional Neural Networks emerged from the study of the human brain, except this time, the technique tries to simulate (vaguely) neural patterns located in the brain’s visual cortex.
Convolutional neural networks are a category of deep learning algorithms, known by its performance on image recognition tasks. Due to the recent increase in computational power, available data, and machine learning techniques, CNN’s achieved superhuman levels of accuracy in some image recognition tasks.
Despite being a type of deep learning algorithm, CNN differs structurally and functionally from “fully connected deep learning networks”. Inspired by the groundbreaking studies of the visual cortex from David Hubel and Torsten Wiesel CNN’s layers aren’t fully connected networks, following the mechanisms from “local receptive fields” found in visual cortex neurons, each layer of neurons will only react to visual stimuli located in a limited region of the “visual field”.
Some neurons will only react to images of horizontal lines, while others will react for vertical or diagonal lines. Some neurons will react to complex shapes and will have larger receptive fields as well. Thanks to these discoveries, we observed that higher-level neurons are based on the outputs of neighboring lower-level neurons.
In 1998, Yann LeCun introduced the famous “LeNet-5” CNN architecture, known by its use in banks for handwritten check numbers recognition. This famous architecture has some building blocks that were already used in traditional fully connected deep learning networks, such as Dense layers or Sigmoid activation functions, however, it introduces two new building blocks: convolutional layers and pooling layers.
The principal constituent of a convolutional neural network is a convolutional layer. Convolutional layers are responsible for the application of “filters”. Filters identify specific features through a “matrix” operation known as convolution. The result of the convolution will be a feature map (i.e.highlighted features of the input).
Pooling layers are simpler to understand, their role is to reduce the CNN complexity, computational load, and overfitting through the reduction or shrinking of the input (i.e. an image).
The overall structure of a CNN can be represented as a sequence of convolutional layers, responsible for the identification of features (e.g. lines, circles, shapes, etc ) and pooling layers, responsible to reduce the computational load of the network. Deeper layers are likely to identify complex shapes, while the low-level layers will identify simpler shapes.
Note: More details on how Convolutional layers work will be presented on the project web app (e.g. feature maps, filters).
Challenges & Improvements
Due to the nature of the data (i.e. images) and the size of the dataset, it was very unreasonable to train the models using a CPU, fortunately, Google Colab and Kaggle provide access to GPU, which made training very fast.
The fire and pedestrian datasets used at the version Alpha v0.4.1 are relatively small and need improvement. The fire model still predicts a lot of false positives, (e.g. images with a lot of yellowish pixels, or bright pixels) and the pedestrian model can’t perform well on images with a lot of components, such as cars, buildings and multiple objects on the ground such as traffic signs.
Another valuable improvement in the fire model is to correctly classify hazardous fire and distinguish it from the intentional fire such as campfires, burning candles, etc. The current version of the project works only with images, a next step would be the implementation of video preprocessing and a potential online training feature, where users can give the feedback directly on the app and label more data for the model.
Other improvements for next versions would be:
- Prediction Speed
- Transfer Learning
- Models Performance
- CIFAR100 implementation (increase the number of objects capable of classifying)
- API access
Computer Vision algorithms, such as CNN’s model are on a promising path towards superhuman accuracy levels, with the increased number of available datasets and new machine learning techniques is likely that these kinds of computer programs will be a baseline for security frameworks of urban centers. However, the implementation of computer vision algorithms may also trigger another type of urban issue, this time, related to privacy.
I hope this report motivates and inspires you to chase and work towards the development of new solutions that can bring equality, peace, safety, health, and food to the hands of city dwellers around the globe. The smart city concept can be tempting and make it seem that technology, by itself, will be humankind savior, this is a misleading belief, only with smart humans, unequal and poor cities will transform into smart cities.
The city of the future awaits you.
— Paris Mollo