Machine Learning Made Easy With Python

February 3, 2021

4 min read

Solve real-world machine learning problems with Naïve Bayes classifiers.

Naïve Bayes is a classification technique that serves as the basis for implementing several classifier modeling algorithms. Naïve Bayes-based classifiers are considered some of the simplest, fastest, and easiest-to-use machine learning techniques, yet are still effective for real-world applications.

Naïve Bayes is based on Bayes’ theorem, formulated by 18th-century statistician Thomas Bayes. This theorem assesses the probability that an event will occur based on conditions related to the event. For example, an individual with Parkinson’s disease typically has voice variations; hence such symptoms are considered related to the prediction of a Parkinson’s diagnosis. The original Bayes’ theorem provides a method to determine the probability of a target event, and the Naïve variant extends and simplifies this method.

Solving a real-world problem

This article demonstrates a Naïve Bayes classifier’s capabilities to solve a real-world problem (as opposed to a complete business-grade application). I’ll assume you have basic familiarity with machine learning (ML), so some of the steps that are not primarily related to ML prediction, such as data shuffling and splitting, are not covered here.

The Naïve Bayes classifier is supervised, generative, non-linear, parametric, and probabilistic.

In this article, I’ll demonstrate using Naïve Bayes with the example of predicting a Parkinson’s diagnosis. The dataset for this example comes from this UCI Machine Learning Repository. This data includes several speech signal variations to assess the likelihood of the medical condition; this example will use the first eight of them:

MDVP:Fo(Hz): Average vocal fundamental frequency
MDVP:Fhi(Hz): Maximum vocal fundamental frequency
MDVP:Flo(Hz): Minimum vocal fundamental frequency
MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, and Jitter:DDP: Five measures of variation in fundamental frequency

The dataset used in this example, shuffled and split for use, is available in my GitHub repository.

ML with Python

I’ll use Python to implement the solution. The software I used for this application is:

Python 3.8.2
Pandas 1.1.1
scikit-learn 0.22.2.post1

There are several open source Naïve Bayes classifier implementations available in Python, including:

NLTK Naïve Bayes: Based on the standard Naïve Bayes algorithm for text classification
NLTK Positive Naïve Bayes: A variant of NLTK Naïve Bayes that performs binary classification with partially labeled training sets
Scikit-learn Gaussian Naïve Bayes: Provides partial fit to support a data stream or very large dataset
Scikit-learn Multinomial Naïve Bayes: Optimized for discrete data features, example counts, or frequency
Scikit-learn Bernoulli Naïve Bayes: Designed for binary/Boolean features

I will use sklearn Gaussian Naive Bayes for this example.

Here is my Python implementation of naive_bayes_parkinsons.py:

import pandas as pd

# Feature columns we use
x_rows=[‘MDVP:Fo(Hz)’,‘MDVP:Fhi(Hz)’,‘MDVP:Flo(Hz)’,
‘MDVP:Jitter(%)’,‘MDVP:Jitter(Abs)’,‘MDVP:RAP’,‘MDVP:PPQ’,‘Jitter:DDP’]
y_rows=[‘status’]

# Train

# Read train data
train_data = pd.read_csv(‘parkinsons/Data_Parkinsons_TRAIN.csv’)
train_x = train_data[x_rows]
train_y = train_data[y_rows]
print(“train_x:\n“, train_x)
print(“train_y:\n“, train_y)

# Load sklearn Gaussian Naive Bayes and fit
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(train_x, train_y)

# Prediction on train data
predict_train = gnb.predict(train_x)
print(‘Prediction on train data:’, predict_train)

# Accuray score on train data
from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(train_y, predict_train)
print(‘Accuray score on train data:’, accuracy_train)

# Test

# Read test data
test_data = pd.read_csv(‘parkinsons/Data_Parkinsons_TEST.csv’)
test_x = test_data[x_rows]
test_y = test_data[y_rows]

# Prediction on test data
predict_test = gnb.predict(test_x)
print(‘Prediction on test data:’, predict_test)

# Accuracy Score on test data
accuracy_test = accuracy_score(test_y, predict_test)
print(‘Accuray score on test data:’, accuracy_train)

Run the Python application:

$ python naive_bayes_parkinsons.py

train_x:
MDVP:Fo(Hz)  MDVP:Fhi(Hz) … MDVP:RAP MDVP:PPQ Jitter:DDP
0   152.125     161.469  …   0.00191   0.00226     0.00574
1   120.080     139.710  …   0.00180   0.00220     0.00540
2   122.400     148.650  …   0.00465   0.00696     0.01394
3   237.323     243.709  …   0.00173   0.00159     0.00519
.. … … … … … …
155   138.190     203.522  …   0.00406   0.00398     0.01218

[156 rows x 8 columns]

train_y:
status
0     1
1     1
2     1
3     0
.. …
155     1

[156 rows x 1 columns]

Prediction on train data: [1 1 1 0 … 1]
Accuracy score on train data: 0.6666666666666666

Prediction on test data: [1 1 1 1 … 1
1 1]
Accuracy score on test data: 0.6666666666666666

The accuracy scores on the train and test sets are 67% in this example; its performance can be optimized. Do you want to give it a try? If so, share your approach in the comments below.

Under the hood

The Naïve Bayes classifier is based on Bayes’ rule or theorem, which computes conditional probability, or the likelihood for an event to occur when another related event has occurred. Stated in simple terms, it answers the question: If we know the probability that event x occurred before event y, then what is the probability that y will occur when x occurs again? The rule uses a prior-prediction value that is refined gradually to arrive at a final posterior value. A fundamental assumption of Bayes is that all parameters are of equal importance.

At a high level, the steps involved in Bayes’ computation are:

Compute overall posterior probabilities (“Has Parkinson’s” and “Doesn’t have Parkinson’s”)
Compute probabilities of posteriors across all values and each possible value of the event
Compute final posterior probability by multiplying the results of #1 and #2 for desired events

Step 2 can be computationally quite arduous. Naïve Bayes simplifies it:

Compute overall posterior probabilities (“Has Parkinson’s” and “Doesn’t have Parkinson’s”)
Compute probabilities of posteriors for desired event values
Compute final posterior probability by multiplying the results of #1 and #2 for desired events

This is a very basic explanation, and several other factors must be considered, such as data types, sparse data, missing data, and more.

Hyperparameters

Naïve Bayes, being a simple and direct algorithm, does not need hyperparameters. However, specific implementations may provide advanced features. For example, GaussianNB has two:

priors: Prior probabilities can be specified instead of the algorithm taking the priors from data.
var_smoothing: This provides the ability to consider data-curve variations, which is helpful when the data does not follow a typical Gaussian distribution.

Loss functions

Maintaining its philosophy of simplicity, Naïve Bayes uses a 0-1 loss function. If the prediction correctly matches the expected outcome, the loss is 0, and it’s 1 otherwise.

Pros and cons

Pro: Naïve Bayes is one of the easiest and fastest algorithms.
Pro: Naïve Bayes gives reasonable predictions even with less data.
Con: Naïve Bayes predictions are estimates, not precise. It favors speed over accuracy.
Con: A fundamental Naïve Bayes assumption is the independence of all features, but this may not always be true.

In essence, Naïve Bayes is an extension of Bayes’ theorem. It is one of the simplest and fastest machine learning algorithms, intended for easy and quick training and prediction. Naïve Bayes provides good-enough, reasonably accurate predictions. One of its fundamental assumptions is the independence of prediction features. Several open source implementations are available with traits over and above what are available in the Bayes algorithm.

This feature is originally appeared in opensource.com

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

admin

Machine Learning Made Easy With Python

Solving a real-world problem

ML with Python

Under the hood

Hyperparameters

Loss functions

Pros and cons

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Darktrace Experiences Surge In Demand Driven By WFH And Wave Of Sophisticated Cyber-Attacks

Synthetic Data’s Role In The Future Of AI

Study could lead to LLMs that are better at complex reasoning

Microsoft, OpenAI, and a US Teachers’ Union Are Hatching a Plan to ‘Bring AI into the Classroom’

People Are Using AI Chatbots to Guide Their Psychedelic Trips

Exploring data and its influence on political behavior

New postdoctoral fellowship program to accelerate innovation in health care

Confronting the AI/energy conundrum

Building secure, scalable AI in the cloud with Microsoft Azure

Robotic probe quickly measures key properties of new materials

Confronting the AI/energy conundrum

Despite Protests, Elon Musk Secures Air Permit for xAI

From Sensual Butt Songs to Santa’s Alleged Coke Habit: AI Slop Music Is Getting Harder to Avoid

Here’s What Mark Zuckerberg Is Offering Top AI Talent

Machine Learning Made Easy With Python

Solving a real-world problem

ML with Python

Under the hood

Hyperparameters

Loss functions

Pros and cons

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Share this article

Darktrace Experiences Surge In Demand Driven By WFH And Wave Of Sophisticated Cyber-Attacks

Synthetic Data’s Role In The Future Of AI

Read next