This page tells the story of how I approached and solved the famous Kaggle competition on house price prediction. It includes the lessons I learned, a line-by-line walkthrough of the code, and helpful context for beginners looking to do the same.
I built my own data cleaner and turned it into a Python package so I could reuse it easily:
pip install git+https://github.com/testgithubprecious/house_price_cleaner.git
The cleaner handles:
🔗 Dataset used: House Prices - Advanced Regression Techniques (Kaggle)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from house_price_cleaner.cleaner import HousePricePreprocessor
These libraries do everything from data loading, cleaning, splitting, training, and evaluating the model.
My custom class HousePricePreprocessor is imported from the installed package.
df = pd.read_csv("train.csv")
y = df["SalePrice"]
X = df.drop(columns=["SalePrice"])
We separate the target variable SalePrice into y, and keep the rest of the features in X.
ordinal_mappings = {
'ExterQual': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
'ExterCond': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
'BsmtQual': {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
'BsmtCond': {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
'HeatingQC': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
'KitchenQual': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
'FireplaceQu': {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
'GarageQual': {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
'GarageCond': {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
'PoolQC': {'NA': 0, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
}
ordinal_features = list(ordinal_mappings.keys())
The ordinal data was manually mapped according to what was written on the data description, so I specified it.
These mappings convert quality features like ExterQual or BsmtQual into meaningful numbers where "Poor" is low and "Excellent" is high.
preprocessor = HousePricePreprocessor(ordinal_mappings, ordinal_features)
X_processed = preprocessor.fit_transform(X, y)
This applies the full cleaning logic on the input data and returns a model-ready dataset.
X_train, X_val, y_train, y_val = train_test_split(X_processed, y, test_size=0.2, random_state=42)
We split the data into 80% training and 20% validation. Using random_state=42 keeps it reproducible.
model = XGBRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
This model is powerful and easy to tune. Here we use 100 trees and a learning rate of 0.1 for balance.
y_pred = model.predict(X_val)
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
r2 = r2_score(y_val, y_pred)
RMSE (Root Mean Squared Error) tells us the average error in price predictions.
R² Score shows how much of the price variation the model explains.
print("Validation RMSE:", rmse)
print("Validation R² Score:", r2)
Outputting results helps you track performance as you experiment and tweak parameters.
test_df = pd.read_csv("test.csv")
X_test_processed = preprocessor.transform(test_df)
test_predictions = model.predict(X_test_processed)
submission = pd.DataFrame({
"Id": test_df["Id"],
"SalePrice": test_predictions
})
submission.to_csv("submission.csv", index=False)
After validating the model, you can use it on the actual test set provided by Kaggle. This generates predictions and saves them in the format required for submission.
If you're interested in learning how I built the HousePricePreprocessor class and why each step matters,
I wrote a beginner-friendly guide that walks through it in detail — from scratch to deployment.
🔗 Check it out on Selar: https://selar.com/2y74rw3227?currency=USD