Kaggle Playground Series S4E5 - Regression with a Flood Prediction Dataset

1. Introduction
In this notebook we will be working on the following Kaggle Challenge on a flood detection problem where the goal is to predict the probability of a region flooding based on various factors.
2. EDA
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.linear_model import ElasticNet
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import VotingRegressor, StackingRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
import warnings
# Import datasets
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
id | MonsoonIntensity | TopographyDrainage | RiverManagement | Deforestation | Urbanization | ClimateChange | DamsQuality | Siltation | AgriculturalPractices | ... | DrainageSystems | CoastalVulnerability | Landslides | Watersheds | DeterioratingInfrastructure | PopulationScore | WetlandLoss | InadequatePlanning | PoliticalFactors | FloodProbability | |
0 | 0 | 5 | 8 | 5 | 8 | 6 | 4 | 4 | 3 | 3 | ... | 5 | 3 | 3 | 5 | 4 | 7 | 5 | 7 | 3 | 0.445 |
1 | 1 | 6 | 7 | 4 | 4 | 8 | 8 | 3 | 5 | 4 | ... | 7 | 2 | 0 | 3 | 5 | 3 | 3 | 4 | 3 | 0.450 |
2 | 2 | 6 | 5 | 6 | 7 | 3 | 7 | 1 | 5 | 4 | ... | 7 | 3 | 7 | 5 | 6 | 8 | 2 | 3 | 3 | 0.530 |
3 | 3 | 3 | 4 | 6 | 5 | 4 | 8 | 4 | 7 | 6 | ... | 2 | 4 | 7 | 4 | 4 | 6 | 5 | 7 | 5 | 0.535 |
4 | 4 | 5 | 3 | 2 | 6 | 4 | 4 | 3 | 3 | 3 | ... | 2 | 2 | 6 | 6 | 4 | 1 | 2 | 3 | 5 | 0.415 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1117952 | 1117952 | 3 | 3 | 4 | 10 | 4 | 5 | 5 | 7 | 10 | ... | 7 | 8 | 7 | 2 | 2 | 1 | 4 | 6 | 4 | 0.495 |
1117953 | 1117953 | 2 | 2 | 4 | 3 | 9 | 5 | 8 | 1 | 3 | ... | 9 | 4 | 4 | 3 | 7 | 4 | 9 | 4 | 5 | 0.480 |
1117954 | 1117954 | 7 | 3 | 9 | 4 | 6 | 5 | 9 | 1 | 3 | ... | 5 | 5 | 5 | 5 | 6 | 5 | 5 | 2 | 4 | 0.485 |
1117955 | 1117955 | 7 | 3 | 3 | 7 | 5 | 2 | 3 | 4 | 6 | ... | 6 | 8 | 5 | 3 | 4 | 6 | 7 | 6 | 4 | 0.495 |
1117956 | 1117956 | 4 | 5 | 6 | 9 | 5 | 5 | 2 | 8 | 4 | ... | 4 | 8 | 6 | 5 | 5 | 6 | 7 | 7 | 8 | 0.560 |
1117957 rows × 22 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1117957 entries, 0 to 1117956
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1117957 non-null int64
1 MonsoonIntensity 1117957 non-null int64
2 TopographyDrainage 1117957 non-null int64
3 RiverManagement 1117957 non-null int64
4 Deforestation 1117957 non-null int64
5 Urbanization 1117957 non-null int64
6 ClimateChange 1117957 non-null int64
7 DamsQuality 1117957 non-null int64
8 Siltation 1117957 non-null int64
9 AgriculturalPractices 1117957 non-null int64
10 Encroachments 1117957 non-null int64
11 IneffectiveDisasterPreparedness 1117957 non-null int64
12 DrainageSystems 1117957 non-null int64
13 CoastalVulnerability 1117957 non-null int64
14 Landslides 1117957 non-null int64
15 Watersheds 1117957 non-null int64
16 DeterioratingInfrastructure 1117957 non-null int64
17 PopulationScore 1117957 non-null int64
18 WetlandLoss 1117957 non-null int64
19 InadequatePlanning 1117957 non-null int64
20 PoliticalFactors 1117957 non-null int64
21 FloodProbability 1117957 non-null float64
dtypes: float64(1), int64(21)
memory usage: 187.6 MB
We see that all columns are in a numeric data type with no null values or duplicated entries, suggesting that the data is clean with no risk of hidden strings that does not conform to the numeric nature of the provided features.
count | mean | std | min | 25% | 50% | 75% | max | |
id | 1117957.0 | 558978.000000 | 322726.531782 | 0.000 | 279489.00 | 558978.000 | 838467.00 | 1117956.000 |
MonsoonIntensity | 1117957.0 | 4.921450 | 2.056387 | 0.000 | 3.00 | 5.000 | 6.00 | 16.000 |
TopographyDrainage | 1117957.0 | 4.926671 | 2.093879 | 0.000 | 3.00 | 5.000 | 6.00 | 18.000 |
RiverManagement | 1117957.0 | 4.955322 | 2.072186 | 0.000 | 4.00 | 5.000 | 6.00 | 16.000 |
Deforestation | 1117957.0 | 4.942240 | 2.051689 | 0.000 | 4.00 | 5.000 | 6.00 | 17.000 |
Urbanization | 1117957.0 | 4.942517 | 2.083391 | 0.000 | 3.00 | 5.000 | 6.00 | 17.000 |
ClimateChange | 1117957.0 | 4.934093 | 2.057742 | 0.000 | 3.00 | 5.000 | 6.00 | 17.000 |
DamsQuality | 1117957.0 | 4.955878 | 2.083063 | 0.000 | 4.00 | 5.000 | 6.00 | 16.000 |
Siltation | 1117957.0 | 4.927791 | 2.065992 | 0.000 | 3.00 | 5.000 | 6.00 | 16.000 |
AgriculturalPractices | 1117957.0 | 4.942619 | 2.068545 | 0.000 | 3.00 | 5.000 | 6.00 | 16.000 |
Encroachments | 1117957.0 | 4.949230 | 2.083324 | 0.000 | 4.00 | 5.000 | 6.00 | 18.000 |
IneffectiveDisasterPreparedness | 1117957.0 | 4.945239 | 2.078141 | 0.000 | 3.00 | 5.000 | 6.00 | 16.000 |
DrainageSystems | 1117957.0 | 4.946893 | 2.072333 | 0.000 | 4.00 | 5.000 | 6.00 | 17.000 |
CoastalVulnerability | 1117957.0 | 4.953999 | 2.088899 | 0.000 | 3.00 | 5.000 | 6.00 | 17.000 |
Landslides | 1117957.0 | 4.931376 | 2.078287 | 0.000 | 3.00 | 5.000 | 6.00 | 16.000 |
Watersheds | 1117957.0 | 4.929032 | 2.082395 | 0.000 | 3.00 | 5.000 | 6.00 | 16.000 |
DeterioratingInfrastructure | 1117957.0 | 4.925907 | 2.064813 | 0.000 | 3.00 | 5.000 | 6.00 | 17.000 |
PopulationScore | 1117957.0 | 4.927520 | 2.074176 | 0.000 | 3.00 | 5.000 | 6.00 | 18.000 |
WetlandLoss | 1117957.0 | 4.950859 | 2.068696 | 0.000 | 4.00 | 5.000 | 6.00 | 19.000 |
InadequatePlanning | 1117957.0 | 4.940587 | 2.081123 | 0.000 | 3.00 | 5.000 | 6.00 | 16.000 |
PoliticalFactors | 1117957.0 | 4.939004 | 2.090350 | 0.000 | 3.00 | 5.000 | 6.00 | 16.000 |
FloodProbability | 1117957.0 | 0.504480 | 0.051026 | 0.285 | 0.47 | 0.505 | 0.54 | 0.725 |
Immediately we see something curious about this dataset. The summary statistics suggests that all feature columns have very similar distributions, likely due to the synthetic nature of the dataset and how it was generated. Hence we will not be attempting to apply any real-world flooding domain specific knowledge to guide us through the challenge.
# Drop useless id column
df_train = df_train.drop('id', axis = 1)
<Axes: >
The above heatmap shows that all feature columns have insignificant pairwise linear correlations, while all feature columns have some linear correlation with the target column.
MonsoonIntensity 0.189098
TopographyDrainage 0.187635
RiverManagement 0.187131
Deforestation 0.184001
Urbanization 0.180861
ClimateChange 0.184761
DamsQuality 0.187996
Siltation 0.186789
AgriculturalPractices 0.183366
Encroachments 0.178841
IneffectiveDisasterPreparedness 0.183109
DrainageSystems 0.179305
CoastalVulnerability 0.177774
Landslides 0.185346
Watersheds 0.181907
DeterioratingInfrastructure 0.190007
PopulationScore 0.185890
WetlandLoss 0.183396
InadequatePlanning 0.180968
PoliticalFactors 0.182417
FloodProbability 1.000000
Name: FloodProbability, dtype: float64
Similar to what we saw in our summary statistics previously, even the pairwise correlations between each feature and the target are very similar in value.
# Plotting all the distributions
fig, ax = plt.subplots(5, 4, figsize=(16,16))
for col, a in zip(df_train.columns[:-1], ax.reshape(-1)):
sns.barplot(pd.DataFrame(df_train[col].value_counts()).reset_index(), x=col, y='count', ax = a, color='blue')
sns.barplot(pd.DataFrame(df_test[col].value_counts()).reset_index(), x=col, y='count', ax = a, color='green')
After plotting the distributions of the features we can clearly see that all of them have the same distribution
pca = PCA()
pca_df = pd.DataFrame({'Explained Variance':pca.explained_variance_ratio_*100, 'Cumulative Explained Variance':np.cumsum(pca.explained_variance_ratio_)*100})
pca_df['Principal Component'] = list(range(len(pca_df)))
Explained Variance | Cumulative Explained Variance | Principal Component | |
0 | 5.154727 | 5.154727 | 0 |
1 | 5.151653 | 10.306380 | 1 |
2 | 5.131702 | 15.438082 | 2 |
3 | 5.107014 | 20.545097 | 3 |
4 | 5.102997 | 25.648093 | 4 |
fig, ax = plt.subplots(figsize=(12,6))
sns.barplot(pca_df, x='Principal Component', y='Explained Variance', ax=ax, color='green')
sns.lineplot(pca_df, x='Principal Component', y='Cumulative Explained Variance', ax=ax, marker='o', color='blue')
_ = ax.bar_label(ax.containers[0],fmt='%.2f%%', fontsize=9)
for cev, feature in zip(pca_df['Cumulative Explained Variance'], pca_df['Principal Component']):
ax.annotate(str(round(cev,2)) + '%', (feature-1, cev+2))
plt.title('PCA Explained Variance')
Text(0.5, 1.0, 'PCA Explained Variance')
From PCA we that the principal components all contribute to the explanation of variance within the dataset.
plt.title('Distribution of Target Flood Probabilities')
Text(0.5, 1.0, 'Distribution of Target Flood Probabilities')
Our target appears to have a normal distribution.
3. Feature Engineering
In this section we will be creating new features that may be helpful in tackling this challenge. We will a bunch of new features by computing some statistics for each sample.
def create_new_features(data, cols):
df = data.copy()
df['sum'] = df[cols].sum(axis=1)
df['mean'] = df[cols].mean(axis=1)
df['median'] = df[cols].median(axis=1)
df['max'] = df[cols].max(axis=1)
df['min'] = df[cols].min(axis=1)
df['std'] = df[cols].std(axis=1)
df['cov'] = df['std']/df['mean']
df['p25'] = df[cols].quantile(0.25, axis=1)
df['p75'] = df[cols].quantile(0.75, axis=1)
df['range'] = df['max'] - df['min']
return df
df_train_new = create_new_features(df_train, df_train.columns[:-1])
df_train_new_only = df_train_new.drop(df_train.columns[:-1], axis = 1)
FloodProbability | sum | mean | median | max | min | std | cov | p25 | p75 | range | |
0 | 0.445 | 94 | 4.70 | 4.5 | 8 | 2 | 1.750188 | 0.372380 | 3.00 | 5.25 | 6 |
1 | 0.450 | 94 | 4.70 | 4.0 | 9 | 0 | 2.296450 | 0.488606 | 3.00 | 6.25 | 9 |
2 | 0.530 | 99 | 4.95 | 5.0 | 8 | 1 | 1.932411 | 0.390386 | 3.00 | 6.25 | 7 |
3 | 0.535 | 104 | 5.20 | 5.0 | 8 | 2 | 1.641565 | 0.315686 | 4.00 | 6.25 | 6 |
4 | 0.415 | 72 | 3.60 | 3.0 | 6 | 1 | 1.500877 | 0.416910 | 2.75 | 5.00 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1117952 | 0.495 | 99 | 4.95 | 4.0 | 10 | 1 | 2.543826 | 0.513904 | 3.00 | 7.00 | 9 |
1117953 | 0.480 | 96 | 4.80 | 4.0 | 9 | 1 | 2.419221 | 0.504004 | 3.00 | 5.50 | 8 |
1117954 | 0.485 | 98 | 4.90 | 5.0 | 9 | 1 | 1.970840 | 0.402212 | 4.00 | 5.25 | 8 |
1117955 | 0.495 | 99 | 4.95 | 5.0 | 8 | 2 | 1.700619 | 0.343559 | 3.75 | 6.00 | 6 |
1117956 | 0.560 | 110 | 5.50 | 5.0 | 9 | 1 | 2.013115 | 0.366021 | 4.75 | 7.00 | 8 |
1117957 rows × 11 columns
sns.heatmap(df_train_new_only.corr(), annot=True)
<Axes: >
Looking at pairwise linear correlation of the new features and the target, sum and mean values of all the features per sample have a high correlation of 0.92. The remaining features have varying levels of correlation with the target, with only standard deviation and range having a lower correlation coefficient than the original features.
fig, ax = plt.subplots(6, 5, figsize=(16,16))
for col, a in zip(df_train_new.drop('FloodProbability', axis=1).columns, ax.reshape(-1)):
sns.scatterplot(df_train_new[[col,'FloodProbability']], x=col, y='FloodProbability', ax = a, color='blue')
Looking at scatterplots of all the features we have vs FloodProbability, we are able to see some of the relationships visually. In particular, positive linear correlation is very visible on sum, mean, median, max, ps25, and ps75.
4. Modelling
For modelling we will be comparing linear regression with regularization, xgboost, and lightgbm algorithms. The model parameters will be tuned and additional voting and stacking estimators will be included in our final comparison to find our best performing estimator.
# Instatiate models and define parameters to gridsearch
random_state = 47
linreg = ElasticNet(random_state = random_state)
linreg_params = {'alpha':[0.1, 1, 10],
'l1_ratio':[0.25, 0.5, 0.75]}
xgb = XGBRegressor(random_state = random_state,
device = 'cuda', error_score='raise')
xgb_params = {'learning_rate': [0.03, .07],
'max_depth': [6, 9],
'min_child_weight': [1,10],
'colsample_bytree': [0.5,1]
lgb = LGBMRegressor(random_state = random_state,
lgb_params = {'num_leaves':[10,31],
'learning_rate':[ 0.01, 0.1],
'colsample_bytree': [0.5, 1],
'reg_alpha': [0, 0.05],
'reg_lambda': [0, 0.05]
rgs = [
('Linear Regression', linreg, linreg_params),
('XGBoost Regressior', xgb, xgb_params),
('LGBM Regressor', lgb, lgb_params)
scorer = {
'r2_score': make_scorer(r2_score),
'mae_score': make_scorer(mean_squared_error),
X_train = df_train_new.drop('FloodProbability', axis=1)
y_train = df_train_new['FloodProbability']
X_train_new = df_train_new_only.drop('FloodProbability', axis=1)
(1117957, 30)
(1117957, 10)
We will evaluate three different datasets
- All features, scaled
- All features, not scaled
- Only new features, scaled
# Three different pipelines
results = []
for rg_name, rg, rg_params in rgs:
gs = GridSearchCV(estimator=rg,
# Original + New Features, Scaled
pipeline_a = Pipeline(steps=[
('scaler', StandardScaler()),
('estimator', gs),
pipeline_a.fit(X_train, y_train)
result = ['All features scaled', rg_name, gs.best_params_, gs.best_score_, gs.cv_results_['mean_test_r2_score'][gs.best_index_], gs.cv_results_['mean_test_mae_score'][gs.best_index_]]
# Original + New Features, Not scaled
pipeline_b = Pipeline(steps=[
('estimator', gs),
pipeline_b.fit(X_train, y_train)
result = ['All features', rg_name, gs.best_params_, gs.best_score_, gs.cv_results_['mean_test_r2_score'][gs.best_index_], gs.cv_results_['mean_test_mae_score'][gs.best_index_]]
# New Features only, Scaled
pipeline_c = Pipeline(steps=[
('scaler', StandardScaler()),
('estimator', gs),
pipeline_c.fit(X_train_new, y_train)
result = ['New features scaled', rg_name, gs.best_params_, gs.best_score_, gs.cv_results_['mean_test_r2_score'][gs.best_index_], gs.cv_results_['mean_test_mae_score'][gs.best_index_]]
Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.2s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.2s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.8s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.6s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.9s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.6s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.2s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.7s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.6s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.4s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.4s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.6s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.8s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 2.3s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 2.1s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.7s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=0.5, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.2s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.2s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.2s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.3s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.3s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.3s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.3s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.2s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.3s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.2s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.2s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.3s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.2s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.2s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.3s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.3s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.3s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.7s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 2.0s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.8s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.6s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.6s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.8s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.8s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 2.4s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.8s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.8s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.8s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.9s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.8s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.7s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.8s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.7s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.9s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.7s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.8s
[CV] END colsample_bytree=1, learning_rate=0.01, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.7s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.2s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.1s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0; total time= 1.0s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.0s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=10, reg_alpha=0.05, reg_lambda=0.05; total time= 1.1s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.6s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.5s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.4s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.6s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.5s
[CV] END colsample_bytree=1, learning_rate=0.1, num_leaves=31, reg_alpha=0.05, reg_lambda=0.05; total time= 1.5s
result_df = pd.DataFrame(results, columns=['Pipeline','Model','Parameters','Best_Score','R2_Score','MSE'])
result_df['Name'] = result_df['Model'] + '_' + result_df['Pipeline']
palette = sns.color_palette()
cmap = {}
for d, color in zip(set(result_df['Pipeline']), palette):
cmap[d] = color
result_df['Color'] = [cmap[d] for d in result_df['Pipeline']]
Pipeline | Model | Parameters | Best_Score | R2_Score | MSE | Name | Color | |
0 | All features scaled | Linear Regression | {'alpha': 0.1, 'l1_ratio': 0.25} | 0.589445 | 0.589445 | 0.001069 | Linear Regression_All features scaled | (0.17254901960784313, 0.6274509803921569, 0.17... |
1 | All features | Linear Regression | {'alpha': 0.1, 'l1_ratio': 0.25} | 0.841341 | 0.841341 | 0.000413 | Linear Regression_All features | (0.12156862745098039, 0.4666666666666667, 0.70... |
2 | New features scaled | Linear Regression | {'alpha': 0.1, 'l1_ratio': 0.25} | 0.589445 | 0.589445 | 0.001069 | Linear Regression_New features scaled | (1.0, 0.4980392156862745, 0.054901960784313725) |
3 | All features scaled | XGBoost Regressior | {'colsample_bytree': 1, 'learning_rate': 0.07,... | 0.869006 | 0.869006 | 0.000341 | XGBoost Regressior_All features scaled | (0.17254901960784313, 0.6274509803921569, 0.17... |
4 | All features | XGBoost Regressior | {'colsample_bytree': 1, 'learning_rate': 0.07,... | 0.869006 | 0.869006 | 0.000341 | XGBoost Regressior_All features | (0.12156862745098039, 0.4666666666666667, 0.70... |
fig, ax = plt.subplots()
sns.barplot(result_df.sort_values('Best_Score', ascending=False), orient='h', x='Best_Score', y='Name', palette=result_df['Color'].values, ax=ax)
ax.set_xlim(min(result_df['Best_Score'])*0.9, max(result_df['Best_Score'])*1.1)
plt.title('GridSearchCV Model Scores')
All features, scaled led to the best performing models, with XGB and LGB having the best performance. We will use these two models in our stacking/voting estimators.
5. Prediction
result_df[(result_df['Model'].isin(['XGBoost Regressior','LGBM Regressor']))&(result_df['Pipeline']=='All features scaled')]['Parameters'].tolist()
[{'colsample_bytree': 1,
'learning_rate': 0.07,
'max_depth': 9,
'min_child_weight': 10},
{'colsample_bytree': 1,
'learning_rate': 0.1,
'num_leaves': 31,
'reg_alpha': 0.05,
'reg_lambda': 0}]
We will feed these hyperparameters from our gridsearch into our new pipeline to include stacking/voting in our comparison.
scaler = StandardScaler()
xgb = XGBRegressor(colsample_bytree= 1,
learning_rate= 0.07,
max_depth= 9,
min_child_weight= 10)
lgb = LGBMRegressor(colsample_bytree= 1,
learning_rate= 0.1,
num_leaves= 31,
reg_alpha= 0.05,
reg_lambda= 0)
vr = VotingRegressor(estimators=[('xgb',xgb),('lgb',lgb)])
sr = StackingRegressor(estimators=[('xgb',xgb),('lgb',lgb)])
rgs = [
('XGBoost Regressor', xgb),
('LGBM Regressor', lgb),
('Voting Regressor', vr),
('Stacking Regressor', sr)
results = []
for name, rg in rgs:
pipeline = Pipeline(
[('scaling', scaler),
('estimator', rg)])
cv = KFold(n_splits=5)
scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='r2')
results.append([name, scores])
pred_df = df_test.copy()[['id']]
pred_df['FloodProbability'] = y_pred
id | FloodProbability | |
0 | 1117957 | 0.578240 |
1 | 1117958 | 0.456551 |
2 | 1117959 | 0.449741 |
3 | 1117960 | 0.466643 |
4 | 1117961 | 0.466660 |
... | ... | ... |
745300 | 1863257 | 0.475449 |
745301 | 1863258 | 0.444587 |
745302 | 1863259 | 0.619708 |
745303 | 1863260 | 0.549273 |
745304 | 1863261 | 0.528544 |
745305 rows × 2 columns
# Export predictions
Our evaluation score on the public leaderboard is 0.86902 which at the time of writing places us within the top 10% of this Kaggle playground.