Linear Regression Models Performance on NASA Airfoil Self-Noise Dataset

Kalpa D. Fernando
13 min readNov 1, 2022

NASA has performed a series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections in the an-echoic wind tunnel. From it, a data set was obtained. It contains the results of numerous wind tunnel studies conducted on NACA 0012 airfoils of varying sizes and angles of attack.

In this article, I would discuss steps toward building three regression models on the Airfoil Self-Noise Dataset and see how we could improve the performance of models. I won’t be posting a full step-by-step coding guide since at the end of the article I have provided a link to the Google Colab Notebook.

Outline

  1. What is regression?
  2. What is linear regression?
  3. Overview of Dataset
  4. Pre-processing
  5. Perform Feature Engineering
  6. Building Three Linear Regression Model
  7. Evaluate the Models
  8. Adding Polynomial Features and re-evaluating the results
  9. Summary and Conclusion

What is Regression?

Regression is a statistical technique used in finance, investment, and other fields to identify the strength and nature of the relationship between a dependent variable (often represented by y) and a sequence of other factors (known as independent variables X).

What is Linear Regression?

Relationships between at least one explanatory variable and an outcome variable are modeled using linear regression. These variables are known as the independent and dependent variables respectively. When there is a single independent variable, the method is referred to as simple linear regression.

Linear Regression Model

There are several types of linear regression models. they are,

  • Simple linear regression: models using only one predictor
  • Multiple linear regression: models using multiple predictors
  • Multivariate linear regression: models for multiple response variables

My goal is to create the three linear regression models previously mentioned which predict the target value i.e. Scaled sound pressure level, using the given attribute values.

In this article, I will be testing sci-kit-learn’s

  • Linear Regression
  • Lasso Regression
  • Ridge regression

Linear regression models

Dataset Overview

First I’ll take the dataset into a pandas DataFrame.

Dataset as a pandas DataFrame

The dataset contains the following attributes:

1. f - Frequency, in Hertz.

2. alpha - Angle of attack, in degrees.

3. c - Chord length, in meters.

4. U_infinity - Free-stream velocity, in meters per second.

5. delta - Suction side displacement thickness, in meters.

6. SSPL - Scaled sound pressure level, in decibels.

The target variable in this data set is the Scaled Sound Pressure Level (SSPL), in decibels.

You can access the dataset from the below link.

Pre-processing

Preprocessing is the most important step when it comes to machine learning. Pre-processing directly affect the model's accuracy, performance, and quality. Cleaning your dataset before handing them over to your model will produce quality results.

The dataset does not have any missing values and does not have any texts to code so the steps handling missing values and coding can be skipped.

  1. Handling Outliers
  2. Feature Transformations
  3. Feature Scaling
  4. Feature Discretization

Handling Outliers

We only perform outlier handling on continuous variables. Because continuous variables have a greater range than discrete variables.

To handle outliers we first need to check if outliers exist for this purpose I’ll be using Tukey’s rule also known as the IQR rule and plot a box plot to visualize outliers clearly.

We can clearly see that there are outliers existing in ‘f’, ‘alpha, ‘delta’ and ‘U_infinity’ (From features). We do not consider SSPL because we do not want to mutate the label to predict.

Let’s see how many outliers we have in each column.

Now, we will calculate the Interquartile Range of the data (IQR = Q3 — Q1).

Then, we will determine our outlier boundaries with IQR.

Q1, Q2, and IQR values for each column

We will get our lower limit with this calculation Q1–1.5 * IQR.
We will get our upper limit with this calculation Q3 + 1.5 * IQR.

Lower limits and upper limits for handling outliers for each column

Now we can handle outliers by checking if they are out of the above limits.

The dataset has only 1503 rows therefore, it is not a good idea to remove outliers. Outliers exist only in ‘f’, ‘alpha’ and ‘delta’ so we can replace them with upper limit (Q3 + 1.5 x IQR) and lower limits (Q1–1.5 x IQR) of ‘f’, ‘alpha’, and ‘delta’.

Now that I have replaced the outliers with their corresponding upper and lower bounds let's see the effects of handling the outliers by observing the box plots of ‘f’, ‘alpha’, and ‘delta’.

Before and after handling outliers of ‘f’, ‘alpha’, and ‘delta’

Outlier handling is now complete let's move to feature transformations

Feature Transformations

Transformation is a method by which we can improve the performance of our model. Feature transformation is a mathematical transformation in which we apply a mathematical formula to a feature and turn the values into a form that is suitable for further analysis.

Let’s plot some Q-Q plots and histograms.

By observing the above figures we can simply say,

  • ‘f’, ‘alpha’, and ‘delta’ is right skewed.
  • ‘c’ and ‘U_infinity’ can be discretized.

Let's measure the skewness of each feature. Using skew from scipy.stats the following values can be observed.

The skewness of each feature

Let's see what the values of skewness interpret

Skewness = 0: normally distributed.
Skewness > 0: more weight in the left tail of the distribution.
Skewness < 0: more weight in the right tail of the distribution.

  • A skewness value greater than 1 or less than -1 indicates a highly skewed distribution.
  • A value between 0.5 and 1 or -0.5 and -1 is moderately skewed.
  • A value between -0.5 and 0.5 indicates that the distribution is fairly symmetrical.

All the features have skewness > 0 values of column f, delta is highly right-skewed and alpha is moderately right skewed while c and U_infinity have fairly symmetrical distribution.

Therefore, we apply transformations only to the columns f, delta, and alpha.

We use exponential transformation for left skewness and logarithmic or square root transformation for right skewness transformation in the general case.

Since all three columns f, delta and alpha are right skewed square root transformation can be used for reducing the right-skewness of distributions.

The skewness of features before (left) and after (right) transformations

Visualizing the effects of transformation on histograms

Before (left) and after (right) applying square root transformation on f, alpha and delta

We can see that the skewness of our dataset has been reduced by transformation.

Now let's move on to the next step of pre-processing.

Feature Scaling

Feature scaling is a technique for standardizing the range of independent variables or features.

If the data in any condition has data points far from each other, scaling is a technique to make them closer to each other in simpler words, we can say that scaling is used for making data points generalized so that the distance between them will be lower.

The min-max scaler is the simplest approach, consisting of rescaling the range of characteristics to fit within the interval [0, 1]. The standard formula for normalizing can be seen below: Here, max(x) and min(x) represent the maximum and least feature values, respectively.

MinMax Scaler equation

Here I will be using MinMaxScaler to scale the data. In the below diagram, we can clearly observe the changes to the dataset after scaling.

Plot before and after scaling the features

Now that scaling has been completed let's move on to the last step of pre-processing for this dataset.

Discretization

Note: I discretized before scaling to mitigate values changing back to higher ranges from a smaller range

Discretization is the process through which continuous variables, models, or functions can be transformed into discrete forms. This is accomplished by generating a collection of contiguous intervals (or bins) that span the range of our desired variable/model/function. Continuous data are measured whereas discrete data are counted.

Histogram of Features and Label

We can clearly observe that bins of c and U_infinity are separated well in this case we can assume that these features can be discretized.

To verify the observation let's take a look at the value counts of c and U_infinity.

Since we have verified the observation now let's apply discretization to c and U_infinity.

I will be using DecisionTreeDiscretiser for this purpose.

Using DecisionTreeDiscretiser to discretize c and U_infinity
c and U_infinity after discretization

The Pre-processing step is completed and the next step is to perform Feature Engineering.

Feature Engineering

Feature engineering is the manipulation (addition, deletion, combination, and mutation ) of our data set to improve machine learning model training, leading to better performance and accuracy. Effective feature engineering is based on solid knowledge of the business problem and the available data sources.

Feature Engineering step

First, we should identify dependent and independent features using a heat map.

Heat map of the dataset

If the correlation between two features is close to 1, then there is a high correlation between them and we can assume that the two features are dependent.

From the above heat map, we can list the features according to correlation for our target variable SSPL as below,

1. f — Frequency, in Hertz.

2. delta — Suction side displacement thickness, in meters.

3. c — Chord length, in meters.

4. alpha — Angle of attack, in degrees.

5.U_infinity Free-stream velocity, in meters per second.

Absolute correlation values feature vs label

Principle Components Analysis

Principal component analysis, or PCA, is a dimensionality-reduction technique used to reduce the dimensionality of large data sets by reducing a large collection of variables into a smaller one that retains the majority of the information in the large set.

Visualization using PCA for the original dataset

Principal component 1 vs Principal component 2

Using PCA it is possible for us to see that this dataset is non-linear and we cannot get good results if we use a linear model such as Linear Regression.

But is it possible to increase the accuracy? Let’s find it out in another chapter.

Let's check out the explained variance ratio which represents the proportion of variance attributable to each of the specified components.

Explained Variance Ratio
Sum of the first 3 and 4 ratios

about 98% of covariance can be taken from the first 4 components and 89% from 3 components.

Therefore, We need only four components to achieve 98% of the covariance for the model and the other component only achieved about 2% of covariance.

It is fine not to take all features to increase accuracy. If we take all features our model could get overfitted and will be failed on when performing in the real world. And also, if I reduce the number of components, then I will get less amount of covariance, and the model can be under-fitted.

Therefore, now I reduced my feature space dimensions from five to four.

pca = PCA(n_components = 4)

Since we have completed feature engineering. It's time to build our linear regression models.

Building Linear Regression models

In this section, I will build three linear regression models to predict the target (SSPL)

I created a function to evaluate a linear regression model by passing the model instance, training, and testing data (Check notebook section c: Predict the value of Y)

I. Linear Regression Model

In scikit-learn, linear regression is the simplest form, where the model is not penalized for its choice of weights. This means that, during the training phase, if the model determines that a specific characteristic is especially relevant, it may assign a high weight to that feature. This can occasionally result in overfitting in small datasets.

Let's fit the data and evaluate

l_regressor = LinearRegression()
fit_predict_and_evaluate(l_regressor,X_train,X_test,y_train,y_test)
Linear Regression Evaluation

II. Lasso Regression

Lasso Regression is a modified version of linear regression in which the model is penalized for the sum of the absolute weight values. Thus, the absolute values of weight will be reduced, and many will tend to be zero.

Finding the alpha value

tried the values alpha = 0.001, 0.01, 0.1, 1, 10.

0.1 gave the best accuracy therefore I selected 0.1 as the alpha value of the lasso model.

la_regressor = Lasso(alpha=0.1)
fit_predict_and_evaluate(la_regressor,X_train,X_test,y_train,y_test)
Lasso Regression Evaluations

III. Ridge Regression

Ridge goes one step further by penalizing the model for the sum of the squared weights. Thus, the weights tend not only to have smaller absolute values but also to punish the extremes of the weights, resulting in a more evenly distributed set of weights.

r_regressor = Ridge()
fit_predict_and_evaluate(r_regressor,X_train,X_test,y_train,y_test)
Ridge Regression Evaluation

Evaluating the models

There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are,

  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
  • Mean Absolute Error (MAE)

Here we have used K-fold Cross-Validation. The dataset is split into a K number of folds (10 in this case) and is used to evaluate the model’s ability when new data is provided. K refers to the number of groups the data sample is split into.

Summary of three models’ performance

These evaluation scores are not satisfactory.

How to improve the accuracy? By adding polynomial features

let's find out, first we need to find the degree of polynomial features for each model.

For Linear Regression

Optimal degree for linear regression

Now let's evaluate the linear regression model with polynomial features

fit_predict_and_evaluate(model,X_train,X_test,y_train,y_test)
Linear Regression with polynomial features
Before and after adding polynomial features to linear regression

Compared to Linear regression without polynomial features here we got really good results.

For Lasso Regression

Optimal degree for lasso regression
Before and after adding polynomial features to lasso regression

For Ridge Regression

Optimal degree for ridge regression
Evaluation of ridge regression with polynomial features
Before and after adding polynomial features to ridge regression

Summary and Conclusion

Previously Using PCA I stated that this dataset is non-linear and cannot get good results if we use a linear model such as Linear Regression.

Summary of model performance before applying polynomial features

To get better results I added polynomial features to all three models and the best-performing model was linear regression with polynomial features as shown in the previous chapter.

Summary of model performance after applying polynomial features

Without polynomial features, the best accuracy we got was about 44% and All three models had almost similar performance on the dataset.

When should one use Linear regression, Ridge regression, and Lasso regression?

Technically speaking, linear regression is a form of Ridge or Lasso regression with a negligent penalty term.

  • LASSO uses L1 penalty, i.e. (2, 0, 0, 0) has the same penalty as (1, 1, 0, 0). As a consequence, small coefficients tend to be shrunk to zero.
  • Ridge regression uses an L2 penalty, i.e. the coefficients (2, 0, 0, 0) have the same penalty as (1, 1, 1, 1). The consequence is that large coefficients are shrunk substantially, but in general none are shrunk to zero.

Ridge and LASSO are related but serve different purposes.

To reduce the variation of parameter estimates, Ridge provides estimates that are biased. This is a good solution for collinearity issues. LASSO selects a collection of variables, taking model complexity into account. This method is appropriate when there are too many variables to use partial least squares or principal component regression. If you have the resources to swiftly cross-validate either a Ridge or Lasso regression over a search grid, it is usually always advisable to test a penalty term in order to limit your bias-variance trade-off.

Use Linear Regression only if 0 is the best coefficient for your lambda penalty term or if you are doing it manually.

Ridge is slightly more rigorous than Lasso, and I’d have to explain the arithmetic to prove this position.

However, Lasso can achieve scarcity, which might be advantageous for feature selection applications.

If the objective of the regression is feature selection (especially if this is part of an automated system), then I would recommend Lasso; otherwise, I would recommend Ridge.

For this dataset, we can clearly see that linear regression without polynomial features is not suitable.

Checkout the Colab notebook which contains explained step-by-step guide to things we have discussed in this article.

However, a few authors have demonstrated that the best performance was shown by Neural Network followed by Support Vector Regression Model. Check out the below repository for more information about the best-performing models on this dataset.

Thank you for reading

--

--

Kalpa D. Fernando

Exploring tech, programming & AI. Passionate about sharing my knowledge and improving my skills via Medium articles.