Non-linear Real Estate Price Prediction (Ames Housing):

This notebook covers:

- Initial exploration of the Ames Housing dataset
- Missing-value handling and categorical encoding
- Feature engineering and justified transformations
- Multicollinearity detection and mitigation
- Pipeline-based modeling with:
- Linear Regression
- Support Vector Regression (SVR)
- Artificial Neural Network (MLPRegressor)
- 5 fold cross-validation and hyperparameter optimization
- Bias-variance diagnostics, residual analysis, and variable impact interpretation
- Statistical model comparison and training-time analysis

1) Setup and Reproducibility:

We fix random seeds globally and import all required libraries.

In [1]:

# standard python libs
import os
import random
import time
import warnings
from typing import Dict, List, Tuple # this one helps me for typing.

warnings.filterwarnings("ignore")

# importing machine learning stuff
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from scipy import stats

# SKLearn stuff
from sklearnex import patch_sklearn
patch_sklearn()

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import (
    KFold,
    RandomizedSearchCV,
    GridSearchCV,
    cross_validate,
    learning_curve,
    train_test_split,
)
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
from sklearn.svm import SVR

Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)

In [2]:

sns.set_theme(style="whitegrid")

RANDOM_STATE = 67
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

print(f"[INFO] Random state fixed at {RANDOM_STATE}")

# 
SALEPRICE_PLOT_COLOR = "steelblue"
LOGSALEPRICE_PLOT_COLOR = "darkorange"

[INFO] Random state fixed at 67

2)Loading dataset:

Dataset source: House Prices - Advanced Regression Techniques

In [3]:

DATASET_PATH = "/kaggle/input/competitions/house-prices-advanced-regression-techniques/train.csv"
df = pd.read_csv(DATASET_PATH)
df.shape

Out[3]:

(1460, 81)

3) Initial Data Exploration

We inspect schema, missingness, and key target behavior.

In [4]:

display(df.head())
print(f"Shape: {df.shape}")
print(f"Columns: {len(df.columns)}\n")
print("Data types:")
display(df.dtypes.value_counts())

missing = df.isna().mean().sort_values(ascending=False)
missing_top = missing[missing > 0].head(20)

fig, axes = plt.subplots(1, 2, figsize=(16, 5))
sns.histplot(df['SalePrice'], kde=True, ax=axes[0], color='teal')
axes[0].set_title('SalePrice Distribution (Raw)')

if len(missing_top) > 0:
    sns.barplot(x=missing_top.values, y=missing_top.index, ax=axes[1], palette='viridis', hue=missing_top.index, legend=False)
    axes[1].set_title('Top Missingness Ratios')
    axes[1].set_xlabel('Missing Ratio')
else:
    axes[1].text(0.5, 0.5, 'No Missing Values', ha='center', va='center')
    axes[1].set_axis_off()

plt.tight_layout()
plt.show()

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

Shape: (1460, 81) Columns: 81 Data types:

object     43
int64      35
float64     3
Name: count, dtype: int64

4) Target Transformation Justification

House prices are usually unevenly distributed: most are average or low, but a few are very high (right-skewed), and the amount they vary changes depending on the price cheaper houses have similar prices, expensive houses vary a lot (heteroscedastic).

We model $\log(1 + SalePrice)$ to stabilize variance and improve regression behavior.

In [5]:

if "Id" in df.columns:
    df = df.drop(columns = ["Id"])
df["LogSalePrice"] = np.log1p(df["SalePrice"])

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.histplot(df["SalePrice"], kde=True, ax=axes[0], color=SALEPRICE_PLOT_COLOR)
axes[0].set_title("Raw SalePrice")
sns.histplot(df["LogSalePrice"], kde=True, ax=axes[1], color=LOGSALEPRICE_PLOT_COLOR)
axes[1].set_title("Log(1 + SalePrice)")
plt.tight_layout()
plt.show()

5) Feature Engineering:

These features were selected to capture structural and temporal characteristics of the properties:
* TotalSF: Sum of livable and basement area.
* HouseAgeAtSale: Age of the property at the time of sale.
* RemodAgeAtSale: Time elapsed since the most recent remodel at sale.
* TotalBaths: Overall bathroom capacity (weighted aggregation of full and half bathrooms).
* HasGarage: Whether the property has a garage or not (boolean value).
* HasBsmnt: Whether the property has a basement or not (boolean value).

These engineered features aim to enhance the model’s capacity to capture non linear relationships while maintaining interpretability within the domain context.

In [6]:

def engineerFeatures(inputDf: pd.DataFrame) -> pd.DataFrame:
    data = inputDf.copy()

    data["TotalSF"] = 0
    for col in ["TotalBsmtSF", "1stFlrSF", "2ndFlrSF"]:
        if col not in data.columns:
            data[col] = 0
        data["TotalSF"] += data[col].fillna(0)

    if {'YrSold', 'YearBuilt'}.issubset(data.columns):
        data['HouseAgeAtSale'] = data['YrSold'].fillna(data['YrSold'].median()) - data['YearBuilt'].fillna(data['YearBuilt'].median())

    if {'YrSold', 'YearRemodAdd'}.issubset(data.columns):
        data['RemodAgeAtSale'] = data['YrSold'].fillna(data['YrSold'].median()) - data['YearRemodAdd'].fillna(data['YearRemodAdd'].median())

    full_bath_cols = [c for c in ['FullBath', 'BsmtFullBath'] if c in data.columns]
    half_bath_cols = [c for c in ['HalfBath', 'BsmtHalfBath'] if c in data.columns]
    data['TotalBaths'] = data[full_bath_cols].fillna(0).sum(axis=1) + 0.5 * data[half_bath_cols].fillna(0).sum(axis=1)

    if 'GarageArea' in data.columns:
        data['HasGarage'] = (data['GarageArea'].fillna(0) > 0).astype(int)
    if 'TotalBsmtSF' in data.columns:
        data['HasBsmt'] = (data['TotalBsmtSF'].fillna(0) > 0).astype(int)

    return data

x = df.drop(columns=['SalePrice', 'LogSalePrice'])
y = df['LogSalePrice']

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2, random_state=RANDOM_STATE)

print(f"Train shape: {xTrain.shape}, Test shape: {xTest.shape}")

Train shape: (1168, 79), Test shape: (292, 79)

6) Multicollinearity Detection and Mitigation

We estimate VIF (Variance Inflation Factor) on numeric engineered features and iteratively drop features with VIF > 10.

Rationale: High multicollinearity inflates variance of linear coefficients and can destabilize interpretation.

In [7]:

VIF_THRESHOLD = 10.0
MAX_DROPS = 15

def computeVIF(numericDF: pd.DataFrame) -> pd.DataFrame:
    work = numericDF.select_dtypes(include=[np.number]).copy()
    work = work.replace([np.inf, -np.inf], np.nan)

    # Median imputation per column; if a column is fully missing, fallback to 0.
    for col in work.columns:
        colMedian = work[col].median()
        if pd.isna(colMedian):
            colMedian = 0.0
        work[col] = work[col].fillna(colMedian)
        
    vals = []
    cols = list(work.columns)

    for col in cols:
        yCol = work[col]
        xCols = work.drop(columns=[col])
        if xCols.shape[1] == 0:
            vals.append((col, 1.0))
            continue

        model = LinearRegression()
        model.fit(xCols, yCol)
        r2 = model.score(xCols, yCol)
        vif = 1.0 / max(1e-8, (1.0 - r2))
        vals.append((col, float(vif)))

    return  pd.DataFrame(vals, columns=["feature", "vif"]).sort_values('vif', ascending=False) if vals else pd.DataFrame(vals, columns=["feature", "vif"])

XTrainFe = engineerFeatures(xTrain)
numericVIFCols = XTrainFe.select_dtypes(include=[np.number]).columns.tolist()
vifDF = XTrainFe[numericVIFCols].copy()

droppedMulticollinear: List[str] = []

for _ in range(MAX_DROPS):
    currentVIF = computeVIF(vifDF)
    if currentVIF.empty:
        break
        
    worst = currentVIF.iloc[0]
    if worst['vif'] <= VIF_THRESHOLD:
        break
    dropFeature = worst["feature"]
    droppedMulticollinear.append(dropFeature)
    vifDF = vifDF.drop(columns=[dropFeature], errors="ignore")

finalVIF = computeVIF(vifDF)
finalVIFHead = finalVIF.head(15)
print("Dropped due to multicollinearity:")
print(droppedMulticollinear if droppedMulticollinear else "None")
display(finalVIFHead)

plt.figure(figsize=(10, 6))
sns.barplot(data=finalVIFHead, x="vif", y="feature", palette='mako', hue="feature", legend=False)
plt.xlabel("VIF (Variance Inflation Factor)")
plt.ylabel("Feature")
plt.title("Top Remaining VIF Values (Plot Mitigation)")
plt.tight_layout()
plt.show()

Dropped due to multicollinearity: ['BsmtFinSF2', 'YearRemodAdd', 'YearBuilt', 'BsmtHalfBath', 'TotalSF', 'GrLivArea', 'TotalBaths', 'TotalBsmtSF']

	feature	vif
8	1stFlrSF	6.300462
19	GarageCars	6.245212
20	GarageArea	5.906320
9	2ndFlrSF	5.772305
30	HouseAgeAtSale	5.526510
6	BsmtFinSF1	5.047759
16	TotRmsAbvGrd	5.036058
7	BsmtUnfSF	4.470887
18	GarageYrBlt	4.313016
3	OverallQual	3.456460
12	FullBath	3.039315
31	RemodAgeAtSale	2.435597
14	BedroomAbvGr	2.343514
32	HasGarage	2.284594
13	HalfBath	2.160178

7) Preprocessing Pipelines

- Missing values are imputed (median for numeric, most frequent for categorical).
- Categorical features are one hot encoded.
- Standardization is applied for SVR and ANN (distance based and gradient based methods), but not required for plain linear regression.

In [8]:

def dropSelectedColumns(data: pd.DataFrame, targets: List[str]) -> pd.DataFrame:
    return data.drop(targets, errors="ignore")

xTrainPrepared = dropSelectedColumns(engineerFeatures(xTrain), droppedMulticollinear)

numFeatures = xTrainPrepared.select_dtypes(include=[np.number]).columns.tolist()
catFeatures = xTrainPrepared.select_dtypes(exclude=[np.number]).columns.tolist()

numericTransformLinear = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

numericTransformScaled = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categoricalTransform = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessorLinear = ColumnTransformer(transformers=[
    ("num", numericTransformLinear, numFeatures),
    ("cat", categoricalTransform, catFeatures)
])

preprocessorScaled = ColumnTransformer(transformers=[
    ('num', numericTransformScaled, numFeatures),
    ('cat', categoricalTransform, catFeatures),
])

featureEngineeringStep = FunctionTransformer(engineerFeatures, validate=False)
dropMulticolStep = FunctionTransformer(lambda d: dropSelectedColumns(d, droppedMulticollinear), validate=False)

cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

print(f'Numeric features: {len(numFeatures)}')
print(f'Categorical features: {len(catFeatures)}')

Numeric features: 42 Categorical features: 43

8) Build Models and Hyperparameter Search

All models use Pipeline and are tuned with CV search.

In [9]:

def rmse(yTrue, yPred):
    return np.sqrt(mean_squared_error(yTrue, yPred))

pipeLR = Pipeline(steps=[
    ("feat", featureEngineeringStep),
    ("drop_multi", dropMulticolStep),
    ("prep", preprocessorLinear),
    ("model", LinearRegression())
])

pipeSVR = Pipeline(steps=[
    ("feat", featureEngineeringStep),
    ("drop_multi", dropMulticolStep),
    ("prep", preprocessorScaled),
    ("model", SVR())
])

pipeANN = Pipeline(steps=[
    ("feat", featureEngineeringStep),
    ("drop_multi", dropMulticolStep),
    ("prep", preprocessorLinear),
    ("model", MLPRegressor(
        random_state = RANDOM_STATE,
        max_iter = 2000,
        early_stopping = True,
        validation_fraction=0.1
    ))
])

gridLR = {
    'model__fit_intercept': [True, False],
    'model__positive': [False, True],
}

randSVR = {
    'model__kernel': ['rbf'],
    'model__C': np.logspace(-1, 2.5, 30),
    'model__epsilon': np.linspace(0.01, 0.5, 30),
    'model__gamma': ['scale', 'auto'],
}

randANN = {
    'model__hidden_layer_sizes': [(64,), (128,), (64, 32), (128, 64)],
    'model__activation': ['relu', 'tanh'],
    'model__alpha': np.logspace(-6, -2, 20),
    'model__learning_rate_init': np.logspace(-4, -2, 20),
}

searches = {
    "Linear Regression": GridSearchCV(
        estimator= pipeLR,
        param_grid= gridLR,
        scoring= "neg_root_mean_squared_error",
        cv= cv,
        n_jobs= 1,
        return_train_score= True
    ),
    "SVR": RandomizedSearchCV(
        estimator= pipeSVR,
        param_distributions= randSVR,
        n_iter= 25,
        random_state= RANDOM_STATE,
        scoring= "neg_root_mean_squared_error",
        cv= cv,
        n_jobs= 1,
        return_train_score= True
    ),
    "ANN": RandomizedSearchCV(
        estimator= pipeANN,
        param_distributions= randANN,
        n_iter= 25,
        random_state= RANDOM_STATE,
        scoring= "neg_root_mean_squared_error",
        cv= cv,
        n_jobs= 1,
        return_train_score= True
    )
}

bestModels = {}
searchSummaries = []

for modelName, search in searches.items():
    print(f"\nTraining {modelName}...")
    start= time.perf_counter()
    search.fit(xTrain, yTrain)
    elapsed= time.perf_counter() - start

    bestModels[modelName] = search.best_estimator_
    searchSummaries.append({
        "model": modelName,
        "bestCVRmse": -search.best_score_,
        "bestTimeSec": elapsed,
        "MeanFitTimeBest": search.cv_results_['mean_fit_time'][search.best_index_],
        "bestParams": search.best_params_
    })

summaryDF = pd.DataFrame(searchSummaries).sort_values("bestCVRmse")
display(summaryDF[["model", "bestCVRmse", "bestTimeSec", "MeanFitTimeBest"]])
summaryDF

Training Linear Regression... Training SVR... Training ANN...

	model	bestCVRmse	bestTimeSec	MeanFitTimeBest
1	SVR	0.136272	20.969226	0.102099
0	Linear Regression	0.182306	2.835429	0.061095
2	ANN	0.319546	377.929576	2.057789

	model	bestCVRmse	bestTimeSec	MeanFitTimeBest	bestParams
1	SVR	0.136272	20.969226	0.102099	{'model__kernel': 'rbf', 'model__gamma': 'auto...
0	Linear Regression	0.182306	2.835429	0.061095	{'model__fit_intercept': True, 'model__positiv...
2	ANN	0.319546	377.929576	2.057789	{'model__learning_rate_init': 0.00233572146909...

8) Holdout Evaluation + Bias-Variance Diagnostics

We evaluate on unseen test data and compare train vs validation errors via cross validation to diagnose underfitting/overfitting.

In [10]:

evaluationRows = []
foldScores = {}

for model_name, model in bestModels.items():
    fitStart = time.perf_counter()
    model.fit(xTrain, yTrain)
    fitTime = time.perf_counter() - fitStart

    predStart = time.perf_counter()
    predTest = model.predict(xTest)
    predTime = time.perf_counter() - predStart

    cvResult = cross_validate(
        model,
        xTrain,
        yTrain,
        cv=cv,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
        n_jobs=1
    )

    trainRmse = -cvResult["train_score"]
    valRmse = -cvResult["test_score"]
    foldScores[model_name] = valRmse

    evaluationRows.append({
        'model': model_name,
        'testRmse': rmse(yTest, predTest),
        'testMae': mean_absolute_error(yTest, predTest),
        'testR2': r2_score(yTest, predTest),
        'cvTrainRmseMean': trainRmse.mean(),
        'cvValRmseMean': valRmse.mean(),
        'biasVarianceGap': valRmse.mean() - trainRmse.mean(),
        'fitTimeSec': fitTime,
        'predTimeSec': predTime,
    })

evalDF = pd.DataFrame(evaluationRows).sort_values("testRmse")
display(evalDF)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.barplot(data=evalDF, x="model", y="testRmse", ax=axes[0], palette="crest", hue="testRmse", legend=True)
axes[0].set_title("Test RMSE (lower is better)")
axes[0].set_xlabel("Model")
axes[0].set_ylabel("Test RMSE")
axes[0].tick_params(axis="x", rotation=15)

sns.barplot(data=evalDF, x="model", y="biasVarianceGap", ax=axes[1], palette="flare", hue="biasVarianceGap", legend=True)
axes[1].axhline(0, linestyle="--", color="black", linewidth=1)
axes[1].set_title("Bias-Variance Gap: CV Val RMSE - CV Train RMSE")
axes[1].set_xlabel("Model")
axes[1].set_ylabel("Bias-Variance Gap")
axes[1].tick_params(axis="x", rotation=15)

plt.tight_layout()
plt.show()

	model	testRmse	testMae	testR2	cvTrainRmseMean	cvValRmseMean	biasVarianceGap	fitTimeSec	predTimeSec
1	SVR	0.106811	0.081612	0.918196	0.104451	0.136272	0.031822	0.195910	0.025118
0	Linear Regression	0.133803	0.091687	0.871627	0.089362	0.182306	0.092944	0.063651	0.022782
2	ANN	0.268396	0.199568	0.483471	0.303049	0.319546	0.016497	1.612850	0.034098

9) Statistical Model Comparison

Using paired $t$-tests on fold-wise CV RMSE values (same folds for all models).

In [11]:

modelNames = list(foldScores.keys())
stateRows = []

for i in range(len(modelNames)):
    for j in range(i + 1, len(modelNames)):
        a = modelNames[i]
        b = modelNames[j]
        tStat, pVal = stats.ttest_rel(foldScores[a], foldScores[b])
        stateRows.append({
            "modelA": a,
            "modelB": b,
            "meanRmseA": np.mean(foldScores[a]),
            "meanRmseB": np.mean(foldScores[b]),
            "tStat": tStat,
            "pValue": pVal,
        })

statsDF = pd.DataFrame(stateRows).sort_values("pValue")
display(statsDF)

print('Interpretation tip: pValue < 0.05 suggests statistically significant fold-level performance difference.')

	modelA	modelB	meanRmseA	meanRmseB	tStat	pValue
2	SVR	ANN	0.136272	0.319546	-15.571130	0.000099
1	Linear Regression	ANN	0.182306	0.319546	-4.834344	0.008435
0	Linear Regression	SVR	0.182306	0.136272	2.314003	0.081672

Interpretation tip: pValue < 0.05 suggests statistically significant fold-level performance difference.

10) Learning Curves (Bias-Variance Visual)

Learning curves show whether additional data may reduce variance or bias for each model.

In [12]:

fig, axes = plt.subplots(1, 3, figsize=(18, 5), sharey=True)
trainSizes = np.linspace(0.2, 1.0, 5)

for ax, (model_name, model) in zip(axes, bestModels.items()):
    sizes, trainScores, valScores = learning_curve(
        model,
        xTrain,
        yTrain,
        cv=cv,
        train_sizes=trainSizes,
        scoring="neg_root_mean_squared_error",
        n_jobs = 1
    )

    trainRmse = -trainScores.mean(axis=1)
    valRmse = -valScores.mean(axis=1)

    ax.plot(sizes, trainRmse, marker="o", label="Train RMSE")
    ax.plot(sizes, valRmse, marker="s", label="CV RMSE")
    ax.set_title(f"{model_name} learning curve")
    ax.set_xlabel("Training Samples")
    ax.grid(alpha=0.3)

axes[0].set_ylabel("RMSE")
axes[0].legend()
plt.tight_layout()
plt.show()

11) Variable Impact Interpretation

- Linear model: direct coefficient interpretation after preprocessing.
- SVR and ANN: permutation importance on test set.

In [13]:

def getFeatureNamesFromPreprocessor(preprocessor: ColumnTransformer) -> np.ndarray:
    return preprocessor.get_feature_names_out()

impactFrames = []

 # Linear coefs:
LRModel = bestModels["Linear Regression"]
LRPrep = LRModel.named_steps["prep"]
LREst = LRModel.named_steps["model"]
LRFeatureNames = getFeatureNamesFromPreprocessor(LRPrep)

coefDF = pd.DataFrame({
    "feature": LRFeatureNames,
    "impact": LREst.coef_
})

coefDF["absImpact"] = coefDF["impact"].abs()
coefTop = coefDF.sort_values("absImpact", ascending=False).head(20)
coefTop["model"] = "Linear Regression (coefficient)"
impactFrames.append(coefTop[["model", "feature", "impact", "absImpact"]])

# Permutation importance for non linear models
for modelName in ["SVR", "ANN"]:
    mdl = bestModels[modelName]
    pi = permutation_importance(
        mdl,
        xTest,
        yTest,
        n_repeats=10,
        random_state=RANDOM_STATE,
        scoring="neg_root_mean_squared_error",
        n_jobs=1
    )

    piDF = pd.DataFrame({
        "feature": xTest.columns,
        "impact": pi.importances_mean
    })
    piDF["absImpact"] = piDF["impact"].abs()
    piTop = piDF.sort_values("absImpact", ascending=False).head(20)
    piTop["model"] = f"{modelName} (permutation)"
    impactFrames.append(piTop[["model", "feature", "impact", "absImpact"]])

impactDF = pd.concat(impactFrames, ignore_index=True)
display(impactDF.head(50))

fig, axes = plt.subplots(1, 3, figsize=(18, 8))
for ax, mdl in zip(axes, impactDF["model"].unique()):
    sub = impactDF[impactDF["model"] == mdl].sort_values("absImpact", ascending=True).tail(15)
    sns.barplot(data=sub, x="absImpact", y="feature", ax=ax, palette="cubehelix", hue="feature", legend=False)
    ax.set_xlabel("Absolute Impact")
    ax.set_ylabel("Feature")
    ax.set_title(mdl)
plt.tight_layout()
plt.show()

	model	feature	impact	absImpact
0	Linear Regression (coefficient)	cat__RoofMatl_ClyTile	-2.258783	2.258783
1	Linear Regression (coefficient)	cat__RoofMatl_Membran	0.812536	0.812536
2	Linear Regression (coefficient)	cat__RoofMatl_Metal	0.640716	0.640716
3	Linear Regression (coefficient)	cat__Condition2_PosN	-0.617063	0.617063
4	Linear Regression (coefficient)	cat__Utilities_AllPub	0.589445	0.589445
5	Linear Regression (coefficient)	cat__GarageQual_Ex	0.499449	0.499449
6	Linear Regression (coefficient)	cat__BsmtCond_Po	0.496187	0.496187
7	Linear Regression (coefficient)	cat__Condition2_PosA	0.494142	0.494142
8	Linear Regression (coefficient)	cat__CentralAir_Y	0.486507	0.486507
9	Linear Regression (coefficient)	cat__Street_Pave	0.478744	0.478744
10	Linear Regression (coefficient)	cat__Alley_Pave	0.465652	0.465652
11	Linear Regression (coefficient)	cat__Alley_Grvl	0.436322	0.436322
12	Linear Regression (coefficient)	cat__Street_Grvl	0.423230	0.423230
13	Linear Regression (coefficient)	cat__CentralAir_N	0.415467	0.415467
14	Linear Regression (coefficient)	cat__LandSlope_Mod	0.395719	0.395719
15	Linear Regression (coefficient)	cat__Condition2_Feedr	0.393939	0.393939
16	Linear Regression (coefficient)	cat__GarageCond_Po	0.384189	0.384189
17	Linear Regression (coefficient)	cat__RoofMatl_WdShngl	0.382020	0.382020
18	Linear Regression (coefficient)	cat__RoofMatl_Roll	0.379270	0.379270
19	Linear Regression (coefficient)	cat__MiscFeature_Othr	0.362884	0.362884
20	SVR (permutation)	2ndFlrSF	0.031193	0.031193
21	SVR (permutation)	OverallQual	0.029872	0.029872
22	SVR (permutation)	OverallCond	0.021233	0.021233
23	SVR (permutation)	GrLivArea	0.019280	0.019280
24	SVR (permutation)	TotalBsmtSF	0.015383	0.015383
25	SVR (permutation)	YearBuilt	0.013890	0.013890
26	SVR (permutation)	LotArea	0.013328	0.013328
27	SVR (permutation)	1stFlrSF	0.012957	0.012957
28	SVR (permutation)	BsmtFinSF1	0.006109	0.006109
29	SVR (permutation)	GarageArea	0.005968	0.005968
30	SVR (permutation)	Fireplaces	0.005531	0.005531
31	SVR (permutation)	TotRmsAbvGrd	0.003590	0.003590
32	SVR (permutation)	Neighborhood	0.003291	0.003291
33	SVR (permutation)	MSZoning	0.003248	0.003248
34	SVR (permutation)	MiscVal	0.003209	0.003209
35	SVR (permutation)	YearRemodAdd	0.002899	0.002899
36	SVR (permutation)	FullBath	0.002764	0.002764
37	SVR (permutation)	GarageCars	0.002282	0.002282
38	SVR (permutation)	HalfBath	0.002265	0.002265
39	SVR (permutation)	PoolArea	0.001972	0.001972
40	ANN (permutation)	2ndFlrSF	0.038569	0.038569
41	ANN (permutation)	LotArea	0.034728	0.034728
42	ANN (permutation)	TotalBsmtSF	0.025261	0.025261
43	ANN (permutation)	BsmtFinSF1	0.013872	0.013872
44	ANN (permutation)	GrLivArea	0.012307	0.012307
45	ANN (permutation)	GarageArea	0.005593	0.005593
46	ANN (permutation)	MasVnrArea	0.003570	0.003570
47	ANN (permutation)	1stFlrSF	0.003274	0.003274
48	ANN (permutation)	YearBuilt	0.002667	0.002667
49	ANN (permutation)	BsmtUnfSF	0.001814	0.001814

12) Runtime and Model Comparison Dashboard

We summarize predictive quality and computational cost together.

In [14]:

compareDF = evalDF.merge(summaryDF[['model', 'bestTimeSec']], on='model', how='left')
display(compareDF.sort_values('testRmse'))

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.barplot(data=compareDF, x='model', y='testRmse', ax=axes[0], palette='Blues')
axes[0].set_title('Test RMSE')
axes[0].set_xlabel("Model")
axes[0].set_ylabel("Test RMSE")
axes[0].tick_params(axis='x', rotation=15)

sns.barplot(data=compareDF, x='model', y='testR2', ax=axes[1], palette='Greens')
axes[1].set_title('Test $R^2$')
axes[0].set_xlabel("Model")
axes[1].set_ylabel("Test R2")
axes[1].tick_params(axis='x', rotation=15)

sns.barplot(data=compareDF, x='model', y='bestTimeSec', ax=axes[2], palette='Reds')
axes[2].set_title('Hyperparameter Search Time (sec)')
axes[0].set_xlabel("Model")
axes[1].set_ylabel("Search Time")
axes[2].tick_params(axis='x', rotation=15)

plt.tight_layout()
plt.show()

	model	testRmse	testMae	testR2	cvTrainRmseMean	cvValRmseMean	biasVarianceGap	fitTimeSec	predTimeSec	bestTimeSec
0	SVR	0.106811	0.081612	0.918196	0.104451	0.136272	0.031822	0.195910	0.025118	20.969226
1	Linear Regression	0.133803	0.091687	0.871627	0.089362	0.182306	0.092944	0.063651	0.022782	2.835429
2	ANN	0.268396	0.199568	0.483471	0.303049	0.319546	0.016497	1.612850	0.034098	377.929576

Final Interpretation

Results Summary

* Best Overall Model: SVR with the strongest performance (RMSE = 0.1068, MAE = 0.0816, R^2 = 0.9182)
* Linear Regression is a competitive baseline with very close stats but still clearly under SVR
* ANN underperformed in this configuration (RMSE = 0.2684, R^2 = 0.4835) and did not generalize as well as the other models.

Bias-Variance Analysis

- SVR shows a moderate generalization gap (cvValRMSE - cvTrainRMSE = 0.0318), suggesting a good balance between fit and generalization.
- Linear Regression shows the largest gap (0.0929), indicating greater variance/overfitting risk relative to SVR.
- ANN has a small gap (0.0165) but high absolute error, which is consistent with underfitting (high bias) in the current search space.

Statistical Comparison

Paired fold-wise tests indicate:
- SVR vs ANN: significant improvement for SVR (p = 0.000099).
- Linear Regression vs ANN: significant improvement for Linear Regression (p = 0.008435).
- Linear Regression vs SVR: difference not significant at alpha = 0.05 (p = 0.081672), though SVR is better in all main holdout metrics.

Variable Impact Interpretation

- For SVR and ANN (permutation importance), the most influential features include: 2ndFlrSF, OverallQual, OverallCond, GrLivArea, TotalBsmtSF, LotArea, YearBuilt, and 1stFlrSF.
- For Linear Regression coefficients, large absolute effects appeared in several one-hot encoded categories (for example roof material and condition-related indicators), reflecting strong categorical contributions in the linear model.

Runtime and Practical Trade-offs

- Linear Regression is fastest to tune (bestTimeSec ~ 2.79s) and predict, making it a strong lightweight baseline.
- SVR delivers the best predictive quality with moderate tuning cost (bestTimeSec ~ 20.10s), representing the best accuracy-cost compromise.
- ANN is most computationally expensive (bestTimeSec ~ 366.38s) while also giving the weakest performance in this setup.

Final Choice

Given predictive performance, bias-variance behavior, and practical runtime, SVR is selected as the final model for non-linear real estate price prediction on this Ames dataset workflow.

Additional: Production-Ready Predictions

This is only for generating a CSV file to submit for the House Prices - Advanced Regression Techniques competition on Kaggle.

In [15]:

TEST_DATASET_PATH = "/kaggle/input/competitions/house-prices-advanced-regression-techniques/train.csv"
testDF = pd.read_csv(TEST_DATASET_PATH)
submissionIds = testDF["Id"].copy()
xProd = testDF.drop(columns=["Id"])

svrModel = bestModels["SVR"]
predLog = svrModel.predict(xProd)
predPrice = np.expm1(predLog)

submissionDF = pd.DataFrame({
    "Id": submissionIds,
    "SalePrice": predPrice
})

submissionDF.to_csv("/kaggle/working/submission_svr.csv", index=False)