How I Learnt to Solve the Kaggle House Price Advanced Regression

This page tells the story of how I approached and solved the famous Kaggle competition on house price prediction. It includes the lessons I learned, a line-by-line walkthrough of the code, and helpful context for beginners looking to do the same.

Step 1: Installing My Custom Cleaner

I built my own data cleaner and turned it into a Python package so I could reuse it easily:

pip install git+https://github.com/testgithubprecious/house_price_cleaner.git

The cleaner handles:

Separation of numerical and categorical features
Comprehensive data cleaning and transformation
Robust feature engineering for better model readiness
Returns a clean, model-ready DataFrame compatible with any ML algorithm

You can use the class too by installing it from the link above.

Step 2: Full Code Walkthrough with Explanation

🔗 Dataset used: House Prices - Advanced Regression Techniques (Kaggle)

Import Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from house_price_cleaner.cleaner import HousePricePreprocessor

These libraries do everything from data loading, cleaning, splitting, training, and evaluating the model. My custom class HousePricePreprocessor is imported from the installed package.

Load the Dataset

df = pd.read_csv("train.csv")
y = df["SalePrice"]
X = df.drop(columns=["SalePrice"])

We separate the target variable SalePrice into y, and keep the rest of the features in X.

Define Ordinal Mappings

ordinal_mappings = {
  'ExterQual': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
  'ExterCond': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
  'BsmtQual': {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
  'BsmtCond': {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
  'HeatingQC': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
  'KitchenQual': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
  'FireplaceQu': {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
  'GarageQual': {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
  'GarageCond': {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
  'PoolQC': {'NA': 0, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
}
ordinal_features = list(ordinal_mappings.keys())

The ordinal data was manually mapped according to what was written on the data description, so I specified it. These mappings convert quality features like ExterQual or BsmtQual into meaningful numbers where "Poor" is low and "Excellent" is high.

Preprocess the Data

preprocessor = HousePricePreprocessor(ordinal_mappings, ordinal_features)
X_processed = preprocessor.fit_transform(X, y)

This applies the full cleaning logic on the input data and returns a model-ready dataset.

Split into Train/Validation Sets

X_train, X_val, y_train, y_val = train_test_split(X_processed, y, test_size=0.2, random_state=42)

We split the data into 80% training and 20% validation. Using random_state=42 keeps it reproducible.

Train the XGBoost Model

model = XGBRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)

This model is powerful and easy to tune. Here we use 100 trees and a learning rate of 0.1 for balance.

Make Predictions and Evaluate

y_pred = model.predict(X_val)
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
r2 = r2_score(y_val, y_pred)

RMSE (Root Mean Squared Error) tells us the average error in price predictions.
R² Score shows how much of the price variation the model explains.

Print the Results

print("Validation RMSE:", rmse)
print("Validation R² Score:", r2)

Outputting results helps you track performance as you experiment and tweak parameters.

Test on the Test CSV

test_df = pd.read_csv("test.csv")
X_test_processed = preprocessor.transform(test_df)
test_predictions = model.predict(X_test_processed)

submission = pd.DataFrame({
    "Id": test_df["Id"],
    "SalePrice": test_predictions
})
submission.to_csv("submission.csv", index=False)

After validating the model, you can use it on the actual test set provided by Kaggle. This generates predictions and saves them in the format required for submission.

What I Learnt

Preprocessing is just as important as modeling
Turning repetitive steps into packages makes projects faster
XGBoost is great for beginners if you understand the basics

If you're interested in learning how I built the HousePricePreprocessor class and why each step matters, I wrote a beginner-friendly guide that walks through it in detail — from scratch to deployment.

🔗 Check it out on Selar: https://selar.com/2y74rw3227?currency=USD