Hamza Bendemra

Building an Employee Churn Model in Python to Develop a Strategic Retention Plan

2019-03-11T06:00:00+00:00

Originally published on Medium on March 11, 2019

Everybody’s working hard but who is most likely to hand in their resignation letter? (Photo by Alex Kotliarskyi on Unsplash)

Problem Definition
Data Analysis
EDA Concluding Remarks
Pre-processing Pipeline
Building Machine Learning Models
Concluding Remarks

1. Problem Definition

Employee turn-over (also known as “employee churn”) is a costly problem for companies. The true cost of replacing an employee can often be quite large.

A study by the Center for American Progress found that companies typically pay about one-fifth of an employee’s salary to replace that employee, and the cost can significantly increase if executives or highest-paid employees are to be replaced.

In other words, the cost of replacing employees for most employers remains significant. This is due to the amount of time spent to interview and find a replacement, sign-on bonuses, and the loss of productivity for several months while the new employee gets accustomed to the new role.

Understanding why and when employees are most likely to leave can lead to actions to improve employee retention as well as possibly planning new hiring in advance. I will be using a step-by-step systematic approach using a method that could be used for a variety of ML problems. This project would fall under what is commonly known as HR Analytics or People Analytics.

Ready to make data-driven decisions! (Photo by rawpixel on Unsplash)

In this study, we will attempt to solve the following problem statement:

• What is the likelihood of an active employee leaving the company?
• What are the key indicators of an employee leaving the company?
• What strategies can be adopted based on the results to improve employee retention?

Given that we have data on former employees, this is a standard supervised classification problem where the label is a binary variable, 0 (active employee), 1 (former employee). In this study, our target variable Y is the probability of an employee leaving the company.

N.B. For complete code, please refer to this GitHub repo and/or the Kaggle Kernel.

2. Data Analysis

In this case study, a HR dataset was sourced from IBM HR Analytics Employee Attrition & Performance which contains employee data for 1,470 employees with various information about the employees. I will use this dataset to predict when employees are going to quit by understanding the main drivers of employee churn.

As stated on the IBM website: “This is a fictional data set created by IBM data scientists. Its main purpose was to demonstrate the IBM Watson Analytics tool for employee attrition.”

Let’s crunch some employee data! (Photo by rawpixel on Unsplash)

2.1 Data Description and Exploratory Visualisations

First, we import the dataset and make of a copy of the source file for this analysis. The dataset contains 1,470 rows and 35 columns.

The dataset contains several numerical and categorical columns providing various information on employee’s personal and employment details.

2.2 Data source

The data provided has no missing values. In HR Analytics, employee data is unlikely to feature large ratio of missing values as HR Departments typically have all personal and employment data on-file.

However, the type of documentation data is being kept in (i.e. whether it is paper-based, Excel spreadsheets, databases, etc) has a massive impact on the accuracy and the ease of access to the HR data.

2.3 Numerical features overview

A few observations can be made based on the information and histograms for numerical features:

• Several numerical features are tail-heavy; indeed several distributions are right-skewed (e.g. MonthlyIncome DistanceFromHome, YearsAtCompany). Data transformation methods may be required to approach a normal distribution prior to fitting a model to the data.
• Age distribution is a slightly right-skewed normal distribution with the bulk of the staff between 25 and 45 years old.
• EmployeeCount and StandardHours are constant values for all employees. They’re likely to be redundant features.
• Employee Number is likely to be a unique identifier for employees given the feature’s quasi-uniform distribution.

2.4 Feature distribution by target attribute

In this section, a more detailed Exploratory Data Analysis is performed. For complete code, please refer to this GitHub repo and/or the Kaggle Kernel.

2.4.1 Age

The age distributions for Active and Ex-employees only differs by one year; with the average age of ex-employees at 33.6 years old and 37.6 years old for current employees.

2.4.2 Gender

Gender distribution shows that the dataset features a higher relative proportion of male ex-employees than female ex-employees, with normalised gender distribution of ex-employees in the dataset at 17.0% for Males and 14.8% for Females.

2.4.3 Marital Status

The dataset features three marital status: Married (673 employees), Single (470 employees), Divorced (327 employees). Single employees show the largest proportion of leavers at 25%.

2.4.4 Role and Work Conditions

A preliminary look at the relationship between Business Travel frequency and Attrition Status shows that there is a largest normalized proportion of Leavers for employees that travel “frequently”. Travel metrics associated with Business Travel status were not disclosed (i.e. how many hours of Travel is considered “Frequent”).

Several Job Roles are listed in the dataset: Sales Executive, Research Scientist, Laboratory Technician, Manufacturing Director, Healthcare Representative, Manager, Sales Representative, Research Director, Human Resources.

2.4.5 Years at the Company and Since Last Promotion

The average number of years at the company for currently active employees is 7.37 years and ex-employees is 5.13 years.

2.4.6 Years with Current Manager

The average number of years with current manager for currently active employees is 4.37 years and ex-employees is 2.85 years.

2.4.7 Overtime

Some employees have overtime commitments. The data clearly show that there is significant larger portion of employees with OT that have left the company.

2.4.8 Monthly Income

Employee Monthly Income varies from $1009 to $19999.

2.4.9 Target Variable: Attrition

The feature “Attrition” is what this Machine Learning problem is about. We are trying to predict the value of the feature ‘Attrition’ by using other related features associated with the employee’s personal and professional history.

In the supplied dataset, the percentage of Current Employees is 83.9% and of Ex-employees is 16.1%. Hence, this is an imbalanced class problem.

Machine learning algorithms typically work best when the number of instances of each classes are roughly equal. We will have to address this target feature imbalance prior to implementing our Machine Learning algorithms.

2.5 Correlation

Let’s take a look at some of most significant correlations. It is worth remembering that correlation coefficients only measure linear correlations.

As shown above, “Monthly Rate”, “Number of Companies Worked” and “Distance From Home” are positively correlated to Attrition; while “Total Working Years”, “Job Level”, and “Years In Current Role” are negatively correlated to Attrition.

3. EDA Concluding Remarks

• The dataset does not feature any missing or erroneous data values, and all features are of the correct data type.
• The strongest positive correlations with the target features are: Performance Rating, Monthly Rate, Num Companies Worked, Distance From Home.
• The strongest negative correlations with the target features are: Total Working Years, Job Level, Years In Current Role, and Monthly Income.
• The dataset is imbalanced with the majority of observations describing Currently Active Employees.
• Single employees show the largest proportion of leavers, compared to Married and Divorced counterparts.
• About 10% of leavers left when they reach their 2-year anniversary at the company.
• People who live further away from their work show higher proportion of leavers compared to their counterparts.
• People who travel frequently show higher proportion of leavers compared to their counterparts.
• People who have to work overtime show higher proportion of leavers compared to their counterparts.
• Employees that have already worked at several companies previously (already “bounced” between workplaces) show higher proportion of leavers compared to their counterparts.

4. Pre-processing Pipeline

In this section, we undertake data pre-processing steps to prepare the datasets for Machine Learning algorithm implementation. For complete code, please refer to this GitHub repo and/or the Kaggle Kernel.

4.1 Encoding

Machine Learning algorithms can typically only have numerical values as their predictor variables. Hence Label Encoding becomes necessary as they encode categorical labels with numerical values. To avoid introducing feature importance for categorical features with large numbers of unique values, we will use both Label Encoding and One-Hot Encoding as shown below.

4.2 Feature Scaling

Feature Scaling using MinMaxScaler essentially shrinks the range such that the range is now between 0 and n. Machine Learning algorithms perform better when input numerical variables fall within a similar scale. In this case, we are scaling between 0 and 5.

4.3 Splitting data into training and testing sets

Prior to implementation or applying any Machine Learning algorithms, we must decouple training and testing dataframe from our master dataset.

5. Building Machine Learning Models

5.1 Baseline Algorithms

Let’s first use a range of baseline algorithms (using out-of-the-box hyper-parameters) before we move on to more sophisticated solutions. The algorithms considered in this section are: Logistic Regression, Random Forest, SVM, KNN, Decision Tree Classifier, Gaussian NB.

Let’s evaluate each model in turn and provide accuracy and standard deviation scores. For complete code, please refer to this GitHub repo and/or the Kaggle Kernel.

Classification Accuracy is the number of correct predictions made as a ratio of all predictions made. It is the most common evaluation metric for classification problems.

However, it is often misused as it is only really suitable when there are an equal number of observations in each class and all predictions and prediction errors are equally important. It is not the case in this project, so a different scoring metric may be more suitable.

Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems. The AUC represents a model’s ability to discriminate between positive and negative classes, and is better suited to this project. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random.

Based on our ROC AUC comparison analysis, Logistic Regression and Random Forest show the highest mean AUC scores. We will shortlist these two algorithms for further analysis.

5.2 Logistic Regression

GridSearchCV allows use to fine-tune hyper-parameters by searching over specified parameter values for an estimator. As shown below, the results from GridSearchCV provided us with fine-tuned hyper-parameter using ROC_AUC as the scoring metric.

5.3 Confusion Matrix

The Confusion matrix provides us with a much more detailed representation of the accuracy score and of what’s going on with our labels — we know exactly which/how labels were correctly and incorrectly predicted. The accuracy of the Logistic Regression Classifier on test set is 75.54.

5.4 Label Probability

Instead of getting binary estimated target features (0 or 1), a probability can be associated with the predicted target. The output provides a first index referring to the probability that the data belong to class 0 (employee not leaving), and the second refers to the probability that the data belong to class 1 (employee leaving). Predicting probabilities of a particular label provides us with a measure of how likely an employee is to leave the company.

5.5 Random Forest Classifier

Let’s take a closer look at using the Random Forest algorithm. I’ll fine-tune the Random Forest algorithm’s hyper-parameters by cross-validation against the AUC score.

Random Forest allows us to know which features are of the most importance in predicting the target feature (“Attrition” in this project). Below, we plot features by their importance.

Random Forest helped us identify the Top 10 most important indicators (ranked in the table below) as: (1) MonthlyIncome, (2) OverTime, (3) Age, (4) MonthlyRate, (5) DistanceFromHome, (6) DailyRate, (7) TotalWorkingYears, (8) YearsAtCompany, (9) HourlyRate, (10) YearsWithCurrManager.

The accuracy of the RandomForest Regression Classifier on test set is 86.14. Below the corresponding Confusion Matrix is shown.

Predicting probabilities of a particular label provides us with a measure of how likely an employee is to leave the company. The AUC when predicting probabilities using RandomForestClassifier is 0.818.

5.6 ROC Graphs

AUC — ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. The green line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).

As shown above, the fine-tuned Logistic Regression model showed a higher AUC score compared to the Random Forest Classifier.

6. Concluding Remarks

6.1 Risk Score

As the company generates more data on its employees (on New Joiners and recent Leavers) the algorithm can be re-trained using the additional data and theoretically generate more accurate predictions to identify high-risk employees of leaving based on the probabilistic label assigned to each feature variable (i.e. employee) by the algorithm.

Employees can be assigning a “Risk Score” based on the predicted label such that:

• Low-risk for employees with label < 0.6
• Medium-risk for employees with label between 0.6 and 0.8
• High-risk for employees with label > 0.8

6.2 Indicators and Strategic Retention Plan

The stronger indicators of people leaving include:

• Monthly Income: people on higher wages are less likely to leave the company. Hence, efforts should be made to gather information on industry benchmarks in the current local market to determine if the company is providing competitive wages.
• Over Time: people who work overtime are more likely to leave the company. Hence efforts must be taken to appropriately scope projects upfront with adequate support and manpower so as to reduce the use of overtime.
• Age: Employees in relatively young age bracket 25–35 are more likely to leave. Hence, efforts should be made to clearly articulate the long-term vision of the company and young employees fit in that vision, as well as provide incentives in the form of clear paths to promotion for instance.
• DistanceFromHome: Employees who live further from home are more likely to leave the company. Hence, efforts should be made to provide support in the form of company transportation for clusters of employees leaving the same area, or in the form of Transportation Allowance. Initial screening of employees based on their home location is probably not recommended as it would be regarded as a form of discrimination as long as employees make it to work on time every day.
• TotalWorkingYears: The more experienced employees are less likely to leave. Employees who have between 5–8 years of experience should be identified as potentially having a higher-risk of leaving.
• YearsAtCompany: Loyal companies are less likely to leave. Employees who hit their two-year anniversary should be identified as potentially having a higher-risk of leaving.
• YearsWithCurrManager: A large number of leavers leave 6 months after their Current Managers. By using Line Manager details for each employee, one can determine which Manager have experienced the largest numbers of employees resigning over the past year.

Several metrics can be used here to determine whether action should be taken with a Line Manager:

• # of years the Line Manager has been in a particular position: this may indicate that the employees may need management training or be assigned a mentor (ideally an Executive) in the organisation
• Patterns in the employees who have resigned: this may indicate recurring patterns in employees leaving in which case action may be taken accordingly.

6.3 Final Thoughts

A strategic retention plan can be drawn for each Risk Score group. In addition to the suggested steps for each feature listed above, face-to-face meetings between a HR representative and employees can be initiated for medium- and high-risk employees to discuss work conditions. Also, a meeting with those employee’s Line Manager would allow to discuss the work environment within the team and whether steps can be taken to improve it.

I hope you enjoyed reading this article as much as I had writing it. Once again, for complete code, please refer to this GitHub repo and/or the Kaggle Kernel.

Soft Skills Will Make or Break You as a Data Scientist

2019-01-12T06:00:00+00:00

Originally published on Medium on January 12, 2019

“What do you mean you can’t give me a definitive yes or no answer?” (Photo by rawpixel on Unsplash)

As businesses gather an increasing amount of data related to various aspects of their organisation (e.g. internal business operations, customer purchases and behaviour), the demand for data-savvy employees has exploded over the last 5 years.

Business leaders have woken up to the fact that data-driven decision-making can lead to making better decisions (it is not the only factor of course, but that’s a discussion for another post). As a result, there is a strong demand for data analysts and data scientists across a wide range of industries.

Since HBR’s declaration that Data Scientist is the “Sexiest Job of the 21st Century” back in 2012, a plethora of online and university courses have flourished to allow interested students to learn the fundamentals of Data Science. However, there are several key aspects that deserve more attention than they get in the current discussion to ensure your long-term success as a Data Scientist. In this post, I will focus on two:

1. There is no such thing as a “typical” Data Scientist experience.

Your journey and work experience as a Data Scientist will massively vary depending on the culture and data maturity level of the organisation you work at.

Your data science journey will vary greatly depending on your organization’s maturity (Photo by Jefferson Santos on Unsplash)

You may spend the first few weeks or months mostly working as a Data Engineer setting up databases, and determining which data infrastructure would be most suitable for the organisation.

After this initial stage, then you may mostly worked as a Data Analyst spending your days writing Python scripts and SQL queries to organise, clean the various datasets collected, and set up automated data pipelines to automate data collection, cleaning, and processing.

After the foundations are built, then will you I start analysing structured and unstructured data to identify trends and patterns using statistical and ML models. Finally, you’ll be able to work as a Data Scientist and built various predictive, classification, and forecasting models.

Now, being able to get to this final stage can take a lot of negotiation and convincing, primarily to buy time to set things up before being able to provide valuable action-oriented insights. The truth is that all the hype surrounding AI is the best and worst thing that’s happened to it.

All the hype surrounding Machine/Deep Learning is the best and worst thing that’s happened to it.

It has increased interest (and funding) in the field but it has also created unreasonable expectations when applied in a business context. It has reduced the work that we do as Expert Statisticians with Strong Programming Skills (isn’t that what a Data Scientist is?) to a string of buzzwords and headlines.

This has created an environment where businesses expect outcomes right away. This is particularly the case if they don’t have an existing data infrastructure as they would often not know the preliminary groundwork that it takes, and I don’t blame them: you don’t know what you don’t know — that’s why you hire a data-savvy employee. Which brings me to my second point.

2. Your soft skills will make or break you as a Data Scientist

Your ability to communicate the value and insights that can be derived from your work is key to your long-term success as a Data Scientist. You need to get C-level staff on board as you need data-driven leadership. Your skills in presenting yourself and in articulating the value of your work will help you in hiring new employees and build your data-driven team.

Slowly, the results you are contributing to coupled with your ability to articulate the process and value of your work, will snowball and create interest in the rest of the organisation. Take this opportunity to up-skill current employees who are interested in learning more, and create allies of your Data Science team in other departments of the organisation.

Effective communication is crucial for data science success (Photo by Campaign Creators on Unsplash)

Written and oral communication skills such as the ability to prepare progress reports, presentations, and interactive data dashboards to communicate findings and insights to key stakeholders will increase the actual and perceived value you provide to the business.

You must be able to translate your findings into business decisions by also providing brief non-technical background on the techniques you used and the biases and uncertainties inherent to the data science process.

Ultimately, in a business context, the best algorithm is not the one with the highest AUC score, it is the one that the stakeholders understand and trust enough to agree to use it effectively.

Building these soft skills alongside strong technical skills is no trivial feat and this is why skilled data scientists are so hard to find and so much sought-after.

But getting the skills to be a full-stack data scientist with strong communication skills is something that can be developed over time and should not be dismissed as optional.

If you found this article helpful, feel free to connect with me on LinkedIn or check out my other articles on Medium.

Using Unsupervised Learning to Plan a Vacation to Paris: Geo-location Clustering

2018-05-22T06:00:00+00:00

Originally published on Medium on May 22, 2018

The Eiffel Tower in the City of Love but there are so many great sights in Paris — how can I help her organise her trip?

When my friend told me she was planning a 10-day trip to Paris, I figured I could help.

Her: “My friend and I are thinking of going to Paris on vacation.”
Me: “Oh sounds fun. Maybe I can join in as well.”
Her: “Yes! We’re planning to enrol in a French language course, and do loads of shopping! We already listed all the shopping malls we want to visit, and …”
Me: “Ah sounds great… so what time do you need me to drive you to the airport?”

Since I’ve been to Paris a few times myself, I figured I could help in other ways like contributing to the list of sights and places to visit. After listing all those sights and attractions, I created a Google map with a pin for each location.

The initial Google Map with pins for sights to visit — but in what order should these sights be visited?

It became quite clear that some scheduling work would be needed to see all that Paris had to offer — but how can she decide what to see first, and in what order? This seemed to me like a clustering problem to me and unsupervised learning method could help solve it.

Algorithms like K-Means, or DBScan would probably do the trick. But first, the data must prepared to facilitate for such algorithm to perform as intended.

Geolocations Collection and Data Preparation

First, I had to gather the Google map pins geo-locations in a 2D format (something that could be stored in a numpy array). Translating those pins into [longitude, latitude] would be perfect.

Since this was a one-off, I looked for a quick way to extract this info from the Google map. This StackOverflow query gave me all that I needed.

Basically, one needs to go to google.com/maps/d/kml?mid={map_id} and download a *.kmz file. Then, manually change the extension of the *.kmz file to a *.zip file. Extract the file, and open doc.kml in a text editor of your choice (SublimeText is my personal go-to).

Then, you may decide to manually CTRL+F to search for fields, or decide to not be so lazy about it and use BeautifulSoup (as shown below)!

from bs4 import BeautifulSoup

# Parse the KML file
with open('doc.kml', 'r') as f:
    soup = BeautifulSoup(f, 'xml')

# Extract coordinates
coordinates = []
for placemark in soup.find_all('Placemark'):
    coords = placemark.find('coordinates')
    if coords:
        coordinates.append(coords.text.strip())

Once I extracted the coordinates from the XML file, I stored the coordinates in a dataframe. The total number of landmarks is 26 (stored in in the XML file).

Dataframe populated with info from the XML file (first 7 rows only shown)

From the dataframe which stored coordinates and the name of the landmark/sight, I generated a scatter plot.

K-Means Clustering

In general, unsupervised learning methods are useful for datasets without labels AND when we do not necessarily know the outcome we are trying the predict. These algorithms typically take one of two forms: (1) clustering algorithms, or (2) dimensionality reduction algorithms.

In this section, we’ll focus on k-means, which is a clustering algorithm. With k-means, a pre-determined number of clusters is provided as input and the algorithm generates the clusters within the un-labeled dataset.

k-means generates a set of k cluster centroids and a labeling of input array X that assigns each of the points in X to a unique cluster. The algorithm determines cluster centroids as the arithmetic mean of all the points belonging to the cluster, AND defines clusters such that each point in the dataset is closer to its own cluster center than to other cluster centers.

It can be implemented in Python using sklearn as follows:

from sklearn.cluster import KMeans
import numpy as np

# Prepare coordinates array
coordinates_array = np.array([[lat, lng] for lat, lng in zip(df['Latitude'], df['Longitude'])])

# Apply K-Means clustering
kmeans = KMeans(n_clusters=10, random_state=42)
cluster_labels = kmeans.fit_predict(coordinates_array)

# Add cluster labels to dataframe
df['Cluster'] = cluster_labels

As one can see in the scatter plot below, I generated 10 clusters — one for each vacation day. But the sights in the center of Paris are so close to each other, it is quite difficult to distinguish one cluster from the other. The resulting predictions were also sorted and stored in a dataframe.

Using the 10 clusters generated by k-means, I generated a dataframe that assigns a day of the week to each cluster. This would constitute an example of a schedule.

Now, at this stage, I could have simply handed over the sorted dataframe, re-organised the Google map pins by layers (i.e. each layer representing one day), and that’s it — itinerary complete.

But something was still bugging me, and that is that k-means was generating clusters based on the Euclidean distance between points — meaning the straight-line distance between two pins in the map.

But as we know, the Earth isn’t flat (right?) so I was wondering if this approximation was affecting the clusters being generated, especially since we have quite a few landmarks far from the high-density region in the center of Paris.

Because the Earth isn’t flat: enter HDBSCAN

Hence, we need a clustering method that can handle Geographical distances, meaning lengths of the shortest curve between two points along the surface of the Earth.

A density function such as HDBSCAN, which is based on the DBScan algorithm, may be useful for this.

Both HDBSCAN and DBSCAN algorithms are density-based spatial clustering method that group together points that are close to each other based on a distance measurement and a minimum number of points. It also marks as outliers the points that are in low-density regions.

Thankfully, HDBSCAN supports haversine distance (i.e. longitude/latitude distances) which will properly compute distances between geo-locations. For more on HDBSCAN, check out this blog post.

HDBSCAN isn’t included in your typical Python distribution so you’ll have to pip or conda install it. I did so, and then ran the code below.

import hdbscan

# Apply HDBSCAN clustering with haversine distance
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=2,
    metric='haversine'
)

# Convert coordinates to radians for haversine distance
coordinates_rad = np.radians(coordinates_array)
cluster_labels = clusterer.fit_predict(coordinates_rad)

# Add cluster labels to dataframe
df['HDBSCAN_Cluster'] = cluster_labels

And ended up with the following scatter plot and dataframe. We see that isolated points were clustered in cluster ‘-1’ which means they were identified as ‘noise’.

Unsurprisingly, we end up with several points being flagged as noise. Since the minimum number of points for a HDBSCAN cluster is 2, isolated locations like the Palais de Versailles were categorised as noise. Sainte-Chapelle de Vincennes, and Musée Rodin suffered a similar fate.

The interesting part however is in the number of clusters that HDBSCAN identified which is 9 — one less than the set vacation days. I guess that for the number of sights/data points that we chose, 10 days sounds like it’ll be okay.

Ultimately, the results from k-means were the ones we used to lay out a schedule as the clusters generated by k-means were similar to the ones generated by HDBSCAN and all data points were included.

Conclusion and Future Improvements

The clustering method presented here can be improved of course. One possible improvement is in the addition of a weight feature for the datapoints. For instance, the weights could represent the amount of time needed to fully visit a particular venue (e.g. Le Louvre easily takes one full day to appreciate), and hence this would affect total number of data points in a cluster with highly weighted points — something to investigate in future projects.

Jupyter Notebook for this mini-project can be found here.

If you found this article helpful, feel free to connect with me on LinkedIn or check out my other articles on Medium.

Build Your First Deep Learning Classifier using TensorFlow: Dog Breed Example

2018-04-26T06:00:00+00:00

Originally published on Medium on April 26, 2018

Convolutional Neural Networks (like the one pictured above) are powerful tools for Image Classification

Introduction

In this article, I will present several techniques for you to make your first steps towards developing an algorithm that could be used for a classic image classification problem: detecting dog breed from an image.

By the end of this article, we’ll have developed code that will accept any user-supplied image as input and return an estimate of the dog’s breed. Also, if a human is detected, the algorithm will provide an estimate of the dog breed that is most resembling.

1. What are Convolutional Neural Networks?

Convolutional neural networks (also referred to as CNN or ConvNet) are a class of deep neural networks that have seen widespread adoption in a number of computer vision and visual imagery applications.

A famous case of CNN application was detailed in this research paper by a Stanford research team in which they demonstrated classification of skin lesions using a single CNN. The Neural Network was trained from images using only pixels and disease labels as inputs.

Convolutional Neural Networks consist of multiple layers designed to require relatively little pre-processing compared to other image classification algorithms.

They learn by using filters and applying them to the images. The algorithm takes a small square (or ‘window’) and starts applying it over the image. Each filter allows the CNN to identify certain patterns in the image. The CNN looks for parts of the image where a filter matches the contents of the image.

An example of a CNN Layer Architecture for Image Classification

The first few layers of the network may detect simple features like lines, circles, edges. In each layer, the network is able to combine these findings and continually learn more complex concepts as we go deeper and deeper into the layers of the Neural Network.

1.1 What kinds of layers are there?

The overall architecture of a CNN consists of an input layer, hidden layer(s), and an output layer. They are several types of layers, for e.g. Convolutional, Activation, Pooling, Dropout, Dense, and SoftMax layer.

Neural Networks consist of an input layer, hidden layers, and an output layer

The Convolutional Layer (or Conv layer) is at the core of what makes a Convolutional Neural Network. The Conv layer consists of a set of filters. Every filter can be considered as a small square (with a fixed width and height) which extends through the full depth of the input volume.

During each pass, the filter ‘convolves’ across the width and height of the input volume. This process results in a 2-dimensional activation map that gives the responses of that filter at every spatial position.

To avoid over-fitting, Pooling layers are used to apply non-linear downsampling on activation maps. In other words, Pooling Layers are aggressive at discarding information but can be useful if used appropriately. A Pooling layer would often follow one or two Conv Layers in CNN architecture.

Dropout Layers are also used to reduce over-fitting, by randomly ignore certain activations functions, while Dense Layers are fully connected layers and often come at the end of the Neural Network.

1.2 What are Activation Functions?

The output of the layers and of the neural network are processed using an activation function, which is a node that is added to the hidden layers and to the output layer.

You’ll often find that the ReLu activation function is used in hidden layers, while the final layer typically consists of a SoftMax activation function. The idea is that by stacking layers of linear and non-linear functions, we can detect a large range of patterns and accurately predict a label for a given image.

SoftMax is often found in the final layer which acts as basically a normalizer and produces a discrete probability distribution vector, which is great for us as the CNN’s output we want is a probability that an image corresponds to a particular class.

When it comes to model evaluation and performance assessment, a loss function is chosen. In CNNs for image classification, the categorical cross-entropy is often chosen (in a nutshell: it corresponds to -log(error)). There are several methods to minimise the error using Gradient Descent — in this article, we’ll rely on “rmsprop”, which adaptive learning rate method, as an optimizer with accuracy as a metric.

2. Setting up the algorithm’s building blocks

To build our algorithm, we’ll be using TensorFlow, Keras (neural networks API running on top of TensorFlow), and OpenCV (computer vision library).

Training and testing datasets were also available on-hand when completing this project (see GitHub repo).

2.1 Detecting if Image Contains a Human Face

To detect whether the image supplied is a human face, we’ll use one of OpenCV’s Face Detection algorithm. Before using any of the face detectors, it is standard procedure to convert the images to grayscale. Below, the detectMultiScale function executes the classifier stored in face_cascade and takes the grayscale image as a parameter.

import cv2

# Load the cascade
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_alt.xml')

def face_detector(img_path):
    img = cv2.imread(img_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray)
    return len(faces) > 0

2.2 Detecting if Image Contains a Dog

To detect whether the image supplied contains a face of a dog, we’ll use a pre-trained ResNet-50 model using the ImageNet dataset which can classify an object from one of 1000 categories. Given an image, this pre-trained ResNet-50 model returns a prediction for the object that is contained in the image.

When using TensorFlow as backend, Keras CNNs require a 4D array as input. The path_to_tensor function below takes a string-valued file path to a color image as input, resizes it to a square image that is 224x224 pixels, and returns a 4D array (referred to as a ‘tensor’) suitable for supplying to a Keras CNN.

from keras.preprocessing import image
from keras.applications.resnet50 import ResNet50, preprocess_input
import numpy as np

def path_to_tensor(img_path):
    # loads RGB image as PIL.Image.Image type
    img = image.load_img(img_path, target_size=(224, 224))
    # convert PIL.Image.Image type to 3D tensor with shape (224, 224, 3)
    x = image.img_to_array(img)
    # convert 3D tensor to 4D tensor with shape (1, 224, 224, 3) and return 4D tensor
    return np.expand_dims(x, axis=0)

def ResNet50_predict_labels(img_path):
    # returns prediction vector for image located at img_path
    img = preprocess_input(path_to_tensor(img_path))
    return np.argmax(ResNet50_model.predict(img))

Also, all pre-trained models have the additional normalization step that the mean pixel must be subtracted from every pixel in each image. This is implemented in imported function preprocess_input.

As shown in the code above, for the final prediction we obtain an integer corresponding to the model’s predicted object class by taking the argmax of the predicted probability vector, which we can identify with an object category through the use of the ImageNet labels dictionary.

3. Build your CNN Classifier using Transfer Learning

Now that we have functions for detecting humans and dogs in images, we need a way to predict breed from images. In this section, we will create a CNN that classifies dog breeds.

To reduce training time without sacrificing accuracy, we’ll be training a CNN using Transfer Learning — which is a method that allows us to use Networks that have been pre-trained on a large dataset. By keeping the early layers and only training newly added layers, we are able to tap into the knowledge gained by the pre-trained algorithm and use it for our application.

Keras includes several pre-trained deep learning models that can be used for prediction, feature extraction, and fine-tuning.

3.1 Model Architecture

As previously mentioned, the ResNet-50 model output is going to be our input layer — called the bottleneck features. In the code block below, we extract the bottleneck features corresponding to the train, test, and validation sets by running the following.

from keras.applications.resnet50 import ResNet50

# define ResNet50 model
ResNet50_model = ResNet50(weights='imagenet', include_top=False)

def extract_Resnet50(tensor):
    return ResNet50_model.predict(preprocess_input(tensor))

We’ll set up our model architecture such that the last convolutional output of ResNet-50 is fed as input to our model. We only add a Global Average Pooling layer and a Fully Connected layer, where the latter contains one node for each dog category and has a Softmax activation function.

from keras.layers import GlobalAveragePooling2D, Dense
from keras.models import Sequential

Resnet50_model = Sequential()
Resnet50_model.add(GlobalAveragePooling2D(input_shape=(7, 7, 2048)))
Resnet50_model.add(Dense(133, activation='softmax'))

Resnet50_model.summary()

As we can see in the above code’s output, we end up with a Neural Network with 272,517 parameters!

3.2 Compile & Test the Model

Now, we can use the CNN to test how well it identifies breed within our test dataset of dog images. To fine-tune the model, we go through 20 iterations (or ‘epochs’) in which the model’s hyper-parameters are fine-tuned to reduce the loss function (categorical cross-entropy) which is optimised using RMS Prop.

Resnet50_model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

Resnet50_model.fit(train_Resnet50, train_targets, 
          validation_data=(valid_Resnet50, valid_targets),
          epochs=20, batch_size=20, verbose=1)

# get index of predicted dog breed for each image in test set
Resnet50_predictions = [np.argmax(Resnet50_model.predict(np.expand_dims(tensor, axis=0))) for tensor in test_Resnet50]

# report test accuracy
test_accuracy = 100*np.sum(np.array(Resnet50_predictions)==np.argmax(test_targets, axis=1))/len(Resnet50_predictions)
print('Test accuracy: %.4f%%' % test_accuracy)

Test accuracy: 80.0239%

Provided with a testing set, the algorithm scored a testing accuracy of 80%. Not bad at all!

3.3 Predict Dog Breed with the Model

Now that we have the algorithm, let’s write a function that takes an image path as input and returns the dog breed that is predicted by our model.

def Resnet50_predict_breed(img_path):
    # extract bottleneck features
    bottleneck_feature = extract_Resnet50(path_to_tensor(img_path))
    # obtain predicted vector
    predicted_vector = Resnet50_model.predict(bottleneck_feature)
    # return dog breed that is predicted by the model
    return dog_names[np.argmax(predicted_vector)]

4. Testing our CNN Classifier

Now, we can write a function that takes accepts a file path to an image and first determines whether the image contains a human, dog, or neither.

If a dog is detected in the image, return the predicted breed. If a human is detected in the image, return the resembling dog breed. If neither is detected in the image, provide output that indicates an error.

def run_app(img_path):
    ## handle cases for a human face, dog, and neither
    if dog_detector(img_path):
        return Resnet50_predict_breed(img_path)
    elif face_detector(img_path):
        return Resnet50_predict_breed(img_path)
    else:
        return "Error: Neither human nor dog detected in image."

We are ready to take the algorithm for a spin! Let’s test the algorithm on a few sample images:

Testing our deep learning classifier on real images

These predictions look accurate to me!

On a final note, I noted that that the algorithm is prone to errors unless it’s a clear facing shot with minimal noise on the image. Hence, we need to make the algorithm more robust to noise. Also, a method we can use to improve our classifier is image augmentation which allows you to “augment” your data by providing variations of the images supplied in the training set.

If you found this article helpful, feel free to connect with me on LinkedIn or check out my other articles on Medium.

Setting up Databases with PostgreSQL, PSequel, and Python

2018-04-23T06:00:00+00:00

Originally published on Medium on April 23, 2018

Working with large databases requires proper tools and techniques

As the demand for Data Scientists continues to increase, and is being dubbed the “sexiest job of the 21st century” by various outlets (including Harvard Business Review), questions have been asked of what skills should aspiring data scientists master on their way to their first data analyst job.

There is now a plethora of online courses to gain the skills needed for a data scientist to be good at their job (excellent reviews of online resources here and here). However, as I reviewed the various courses myself, I noticed that a lot of focus is put on exciting and flashy topics like Machine Learning and Deep Learning without covering the basics of what is needed to gather and store datasets needed for such analysis.

(Big) Data Science

Before we go into PostgreSQL, I suspect many of you have the same question: why should I care about SQL?

Although database management may seem like a boring topic for aspiring data scientists — implementing a dog breed classifier is very rewarding I know! — it is a necessary skill once you join the industry and the data supports this: SQL remains the most common and in-demand skill listed in LinkedIn job postings for data science jobs.

SQL is a necessary skill in many data science applications with large datasets

Pandas can perform the most common SQL operations well but is not suitable for large databases — its main limit comes to the amount of data one can fit in memory. Hence, if a data scientist is working with large databases, SQL is used to transform the data into something manageable for pandas before loading it in memory.

Furthermore, SQL is much more than just a method to have flat files dropped into a table. The power of SQL comes in the way that it allows users to have a set of tables that “relate” to one another, this is often represented in an “Entity Relationship Diagram” or ERD.

Many data scientists use both simultaneously — they use SQL queries to join, slice and load data into memory; then they do the bulk of the data analysis in Python using pandas library functions.

Example of an ERD showing relationships between database tables

This is particularly important when dealing with the large datasets found in Big Data applications. Such applications would have tens of TBs in databases with several billion rows.

Data Scientists often start with SQL queries that would extract 1% of data needed to a csv file, before moving to Python Pandas for data analysis.

Enter PostgreSQL

There is a way to learn SQL without leaving the much loved Python environment in which so much Machine Learning and Deep Learning techniques are taught and used: PostgreSQL.

PostgreSQL allows you to leverage the amazing Pandas library for data wrangling when dealing with large datasets that are not stored in flat files but rather in databases.

PostgreSQL can be installed in Windows, Mac, and Linux environments (see install details here). If you have a Mac, I’d highly recommend installing the Postgres.App SQL environment. For Windows, check out BigSQL.

PostgreSQL uses a client/server model. This involves two running processes:

• Server process: manages database files, accepts connections to the database from client applications, and performs database actions on behalf of the clients. • User Client App: often involves SQL command entries, a friendly graphical interface, and some database maintenance tool.

In real-case scenarios, the client and the server will often be on different hosts and they would communicate over a TCP/IP network connection.

Installing Postgres.App and PSequel

I’ll be focusing on Postgres.App for Mac OS in the rest of the tutorial. After installing PostgresApp, you can setup your first database by following the instructions below:

• Click “Initialize” to create a new server • An optional step is to configure a $PATH to be able to use the command line tools delivered with Postgres.app by executing the following command in Terminal and then close & reopen the window:

sudo mkdir -p /etc/paths.d && echo /Applications/Postgres.app/Contents/Versions/latest/bin | sudo tee /etc/paths.d/postgresapp

You now have a PostgreSQL server running on your Mac with default settings: Host: localhost, Port: 5432, Connection URL: postgresql://localhost

The PostgreSQL GUI client we’ll use in this tutorial is PSequel. It has a minimalist and easy to use interface that I really enjoy to easily perform PostgreSQL tasks.

Graphical SQL Client of choice: PSequel provides a clean, intuitive interface

Creating Your First Database

Once Postgres.App and PSequel installed, you are now ready to set up your first database! First, start by opening Postgres.App and you’ll see a little elephant icon appear in the top menu.

You’ll also notice a button that allows you to “Open psql”. This will open a command line that will allow you to enter commands. This is mostly used to create databases, which we will create using the following command:

CREATE DATABASE sample_db;

Then, we connect to the database we just created using PSequel. We’ll open PSequel and enter the database name, in our case: sample_db. Click on “Connect” to connect to the database.

Creating and Populating Your First Table

Let’s create a table (consisting of rows and columns) in PSequel. We define the table’s name, and the name and type of each column.

The available datatypes in PostgreSQL for the columns (i.e. variables), can be found on the PostgreSQL Datatypes Documentation.

In this tutorial, we’ll create a simple table with world countries. The first column will provide each country an ‘id’ integer, and the second column will provide the country’s name using variable length character string (with 255 characters max).

CREATE TABLE country_list (
    id INTEGER,
    name VARCHAR(255)
);

Once ready, click “Run Query”. The table will then be created in the database. Don’t forget to click the “Refresh” icon (bottom right) to see the table listed.

We are now ready to populate our columns with data. There are many different ways to populate a table in a database. To enter data manually, the INSERT will come in handy. For instance, to enter the country Morocco with id number 1, and Australia with id number 2, the SQL command is:

INSERT INTO country_list (id, name) VALUES (1, 'Morocco');
INSERT INTO country_list (id, name) VALUES (2, 'Australia');

In practice, populating the tables in the database manually is not feasible. It is likely that the data of interest is stored in CSV files. To import a CSV file into the country_list_csv table, you use COPY statement as follows:

COPY country_list_csv(id,name)
FROM 'C:\{path}\{file_name}.csv' 
DELIMITER ',' CSV HEADER;

As you can see in the commands above, the table with column names is specified after the COPY command. The columns must be ordered in the same fashion as in the CSV file. The CSV file path is specified after the FROM keyword. The CSV DELIMITER must also be specified.

If the CSV file contains a header line with column names, it is indicated with the HEADER keyword so that PostgreSQL ignores the first line when importing the data from the CSV file.

Common SQL Commands

The key to SQL is understanding statements. A few statements include:

CREATE TABLE is a statement that creates a new table in a database.
DROP TABLE is a statement that removes a table in a database.
SELECT allows you to read data and display it.

SELECT is where you tell the query what columns you want back. FROM is where you tell the query what table you are querying from. Notice the columns need to exist in this table. For example, let’s say we have a table of orders with several columns but we are only interested in a subset of three:

SELECT id, account_id, occurred_at
FROM orders;

Also, the LIMIT statement is useful when you want to see just the first few rows of a table. This can be much faster for loading than if we load the entire dataset. The ORDER BY statement allows us to order our table by any row. We can use these two commands together to a table of ‘orders’ in a database as:

SELECT id, account_id, total_amt_usd
FROM orders
ORDER BY total_amt_usd DESC
LIMIT 5;

Explore Other SQL Commands

Now that you know how to setup a database, create a table and populate in PostgreSQL, you can explore the other common SQL commands as explained in the following tutorials:

• SQL Basics for Data Science • Khan Academy: Intro to SQL • SQL in a Nutshell • Cheat Sheet SQL Commands

Accessing PostgreSQL Database in Python

Once your PostgreSQL database and tables are setup, you can then move to Python to perform any Data Analysis or Wrangling required.

PostgreSQL can be integrated with Python using psycopg2 module. It is a popular PostgreSQL database adapter for Python. It is shipped along with default libraries in Python version 2.5.x onwards.

Connecting to an existing PostgreSQL database can be achieved with:

import psycopg2

conn = psycopg2.connect(
    database="sample_db", 
    user="postgres", 
    password="pass123", 
    host="127.0.0.1", 
    port="5432"
)

Going back to our country_list table example, inserting records into the table in sample_db can be accomplished in Python with the following commands:

cur = conn.cursor()
cur.execute("INSERT INTO country_list (id, name) VALUES (1, 'Morocco')")
cur.execute("INSERT INTO country_list (id, name) VALUES (2, 'Australia')")
conn.commit()
conn.close()

Other commands for creating, populating and querying tables can be found on various tutorials on Tutorial Points and PostgreSQL Tutorial.

Conclusion

You now have a working PostgreSQL database server ready for you to populate and play with. It’s powerful, flexible, free and is used by numerous applications.

The combination of PostgreSQL’s robustness and Python’s analytical capabilities provides data scientists with a powerful toolkit for handling large datasets that don’t fit comfortably in memory. By mastering these fundamental database skills alongside your machine learning knowledge, you’ll be well-equipped to handle real-world data science challenges.

If you found this article helpful, feel free to connect with me on LinkedIn or check out my other articles on Medium.

Using R to Visualise Data on Aviation Tragedies in the US since 1948

2018-02-16T06:00:00+00:00

Originally published on Medium on February 16, 2018

NTSB investigators looking at aviation accident scenes help us understand safety patterns

In this post, I look at a dataset sourced from the NTSB Aviation Accident Database which contains information about civil aviation accidents. A dataset is available on Kaggle also.

This Exploratory Data Analysis (EDA) aims to perform an initial exploration of the data and get an initial look at relationships between the various variables present in the dataset.

My aim is to also show how a simple understanding of Data Analysis and Wrangling in R coupled with domain knowledge can provide a better understanding of relationships between variables in a dataset. The R Markdown file can found in this GitHub repo.

Introduction

First, a quick intro of the dataset I’ll be exploring. The dataset features 81,013 observations of 31 variables which are related to aviation accidents recorded.

# Dataset dimensions
dim(aviation_data)
## [1] 81013    31

Variables provide information on a variety of topics including date and location of observations, model and type of aircraft, information on the sustained injuries to passengers and to the aircraft, and the reported weather conditions at the time.

# Variable names in the dataset
names(aviation_data)
## [1] "Event.Id"               "Investigation.Type"    
## [3] "Accident.Number"        "Event.Date"            
## [5] "Location"               "Country"               
## [7] "Latitude"               "Longitude"             
## [9] "Airport.Code"           "Airport.Name"          
## [11] "Injury.Severity"        "Aircraft.Damage"       
## [13] "Aircraft.Category"      "Registration.Number"   
## [15] "Make"                   "Model"                 
## [17] "Amateur.Built"          "Number.of.Engines"     
## [19] "Engine.Type"            "FAR.Description"       
## [21] "Schedule"               "Purpose.of.Flight"     
## [23] "Air.Carrier"            "Total.Fatal.Injuries"  
## [25] "Total.Serious.Injuries" "Total.Minor.Injuries"  
## [27] "Total.Uninjured"        "Weather.Condition"     
## [29] "Broad.Phase.of.Flight"  "Report.Status"         
## [31] "Publication.Date"

Data Wrangling

Since this an NTSB database from the United States, the majority of accidents (over 94%) in this database are observations in the US. Hence, I will be focusing on the accidents that took place in the US in this analysis. After removing international observations the new dataframe now features 76,188 observations.

# Filter for US accidents only
us_aviation_data <- aviation_data[aviation_data$Country == "United States", ]
dim(us_aviation_data)
## [1] 76188    31

Some data wrangling was necessary of course (details of which can be found in the R Markdown file in the GitHub repo). For instance, the listed location names (city, state) was separated into two variables: one for the city and one for the state, for each observation.

The variable related to the observation’s event date was broken down into the observation’s event date by day, month, and year to investigate if there are any correlations between number of accidents and particular periods within a year.

Also, a better way of displaying data related to total fatalities for a given observation was to group number of fatalities in buckets (or bands). This will give us a better representation of the distribution of fatalities across the observations in the dataset.

Univariate Plots Section

In this section, I will create univariate plots for variables of interest.

Accidents by Year, Month, and Weekday

Let’s plots frequency histograms for the year, month, and weekday of accidents in the dataset. The majority of the observations in the dataframe are from after the early 1980s onwards. So let’s generate a plot from 1980 to 2017.

Aviation accidents by year show an overall declining trend

The number of accidents has overall decreased by approx. 47% between 1982 and 2017 from approx. 3400 observations to approx. 1600 observations.

Next, let’s look at observations distribution by months of the year.

The highest number of accidents in the dataset for a given year take place during northern hemisphere summer time (Jun-Jul-Aug). This is also likely to be correlated with the increased numbers of flights during the summer holiday period.

And finally, let’s look at the observations distribution by day of the week.

The highest frequency of accidents in a given week take place during the weekend (Sat-Sun). Again, this is also likely to be correlated with the increased numbers of flights during the summer holiday period.

Total Fatal Injuries

The next variable of interest relates to the Total Fatal Injuries for each observation in the dataset. This is quantified by the number of people fatally injured for each recorded observation. Let’s group the number of fatalities in buckets as shown in the plot below. Note the use of the Log10 scale for the y-axis in the plot below.

The bulk of the recorded accidents have a number of fatalities <10 while some observations are displaying numbers of fatalities >100.

Engine Types

Next, I look at the aircraft engine types recorded in the dataset. I’ve abbreviated engine type names to improve labelling of the x-axis. Note the use of the Log10 scale for the y-axis in the plot below.

According to the plots above, the bulk of engine types in the reported accidents are Reciprocating engine types which were prevalent in commercial aircraft, particularly in aircraft built during the 20th century. Recent aircraft, like the Airbus A380 or the Boeing 787 Dreamliner rely on Jet engines (e.g. TurboFan, TurboProp).

Weather Conditions

Next, I look at the weather conditions recorded in the dataset. Two key aviation weather conditions to be familiar with here: VMC which means that conditions are such that pilots have sufficient visibility to fly the aircraft maintaining visual separation from terrain and other aircraft. IMC which means weather conditions require pilots to fly primarily by reference to instruments.

The bulk of accidents in the dataset take place during VMC weather conditions, which are great conditions for flying as VMC requires greater visibility and cloud clearance than IMC.

I imagine this would go against most people’s intuition when it comes to the relationship between weather conditions and aviation accidents. Pilots are indeed well trained to fly in all sorts of weather conditions, relying solely on the avionics instruments at their disposal.

Broad Phases of Flight

Next, let’s look at the phases of flight for the recorded accidents in the dataset. According to the plot, the bulk of accidents took place during landing or take-off. It is well known in the industry that these are high-risk — often referred to as “critical phases of flight”.

Take-off and landing are the most critical phases of flight

Bivariate Plots Section

Let’s look at the relationship between pairs of variables that could show interesting relationships.

Engine Types and Total Fatal Injuries

The bulk of the distribution has a total fatal injuries under 10, let’s zoom in on that portion of the data. The R function geom_jitter was used to amplify the visualisation of the data points.

According to the plot, the bulk of the data for fatalities under 10 is with the engine type reciprocating engine type. The first plot shows that the Turbo-Fan engine has more outliers with higher number of fatalities than other engines. This is likely correlated to the use of Turbo-Fan engines use on large commercial aircraft.

Weather Conditions and Total Fatal Injuries

As previously noted, weather conditions do not show a particularly strong relationship with total fatal injuries. The bulk of the distribution is associated with VMC weather conditions. However, that is likely due to the fact that the vast majority of flights are flown in VMC conditions.

Phase of Flight and Total Fatal Injuries

Let’s look at the relationship of Phase of Flight and Total Fatal Injuries.

The plots show that Take-Off and Approach are associated with outliers with high number of fatalities. As previously noted, these two phases of flight are often referred to as “critical phases of flight” for that particular reason.

Event Month and Weekday and Total Fatal Injuries

Let’s look at the relationship of Event Date and Total Fatal Injuries. We’ll focus on the bulk of the distribution with fatalities of less than 10. There doesn’t seem to be any specific month or weekday showing a particularly high frequency of accidents.

Broad Phase of Flight and Weather Conditions

The plots indicate that there is higher frequency of recorded observations for certain combinations of weather and phases of flight — for example, IMC flying conditions while during “cruise” or “approach” phases of flight.

Broad Phase of Flight and Event Month

The plots indicate that there is a higher frequency of recorded observations for Northern summer months during Landing and Take-off. Across all months, the heat map also shows that the Take-off and Landing register the highest number of observations.

Longitude and Latitude of Recorded Accidents

Plotting the Latitude vs Longitude of the accidents essentially gives us the map of the US. The plots also indicate that the coastal states are more heavily impacted compared to mid-western states and most of Alaska. This can be explained by the volume of flights to/from destinations in those areas of the US. A sad chart however as it shows that the vast majority of US States suffered an aviation tragedy between 1948 and 2017.

Multivariate Plots Section

Now let’s look at multivariate plots.

Latitude vs Longitude of observations by Month of the Year

The distribution of accident months across latitude and longitude is fairly spread across the US, with a slightly higher prevalence of observations during the winter in southern states like Florida.

Longitude and Latitude by Weather Conditions

Let’s now look at latitude vs longitude add layer for the weather condition.

Weather condition VMC seems to be quite consistent except for patches of primarily IMC conditions for certain discrete areas of the mid-west.

Broad Phase of Flight and Event Month by Weather Conditions

Let’s now look at the relationship of Broad Phase of Flight vs Month by Weather Conditions. I decided to leave the observations with “unknown” weather conditions as there is what I perceive to be non-negligible number of observations, particularly for the “Cruise” phase of flight.

When looking at the relationship between Broad Phase of Flight vs Month and Weather Condition, we can see that accidents primarily take place during VMC weather conditions. However, for certain months of the year such as December and January, IMC conditions are a non-negligible portion of the observations, particularly during Approach and Cruise phases of flight.

Total Fatal Injuries and Broad Phases of Flight by Weather Condition

Next, let’s look at Total Fatal Injuries vs Broad Phases of Flight by Weather Condition. I decided to focus on total fatal injuries of less than 40 to highlight the bulk of the distribution of recorded fatal injuries.

IMC weather conditions are associated with accidents during “Cruise” and “Approach” phases of flight with higher frequencies than in VMC weather conditions.

Total Fatal Injuries and Engine Type by Year

This plot is interesting as it shows how certain engines have become prevalent in different time periods. For instance, Turbo-Jet and Turbo-Fan powered aircraft show a higher number of fatalities in later years whereas reciprocating engines show a distribution of fatalities in earlier years, this corresponds to the increasing use of jet engines in modern aircraft.

Evolution of aviation technology reflected in accident data

Concluding Remarks

I hope you enjoyed this EDA and learned a few things about aviation along the way. I hope that this post also shows the power of simple data visualisations with ggplot2 in R. This skill is particularly useful when exploring a dataset in the initial phase of the data science pipeline. I certainly enjoyed this fun little study done over the weekend.

Key Findings:

Aviation accidents have decreased by ~47% from 1982 to 2017
Most accidents occur during summer months and weekends (higher flight volume)
Landing and take-off remain the most critical phases of flight
Counterintuitively, most accidents occur in good weather conditions (VMC)
Engine technology evolution is clearly visible in the accident patterns over time

If you found this article helpful, feel free to connect with me on LinkedIn or check out my other articles on Medium.