House Price

The challenge is on ! The goal of this notebook is to show what features influences the most the sale prices of Houses in Ames. The dataset comes from Kaggle and has been originally launch to study regression techniques. But before any Machine Learning technique, it is obligatory to study the shape, distribution ans essential features of a dataset.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np

data = pd.read_csv('train.csv')
data.head()

Data exploration and cleaning¶

data.shape

(1460, 81)

data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

data.isnull().sum().sort_values(ascending=False)[:20]

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageCond        81
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
BsmtExposure      38
BsmtFinType2      38
BsmtFinType1      37
BsmtCond          37
BsmtQual          37
MasVnrArea         8
MasVnrType         8
Electrical         1
Utilities          0
dtype: int64

#After reading the description.txt, I choose to delete columns that has no interest in my research or are in duplicates
data = data.drop(columns=['Id','Street'],axis=1)
#I also delete the column PoolQC which has a lot of missing values
data = data.drop(columns=['PoolQC','MiscFeature'],axis=1)

# Adding a new feature TotalBathrooms with sum of FullBath and HalfBath + in basement
data['TotalBathrooms'] = (data['FullBath'] + (0.5 * data['HalfBath']) + data['BsmtFullBath'] + (0.5 * data['BsmtHalfBath']))

# When NA means 'none' in the description.txt I fill the missing values of the features by 'None'
for col in ('FireplaceQu',
            'Fence',
            'Alley',
            'GarageType', 
            'GarageFinish', 
            'GarageQual', 
            'GarageCond'):
    data[col]=data[col].fillna('None')

# After checking if the house has a garage, I fill the missing values of GarageYrBlt by the median value. 
#If there is no garage I fill GarageYrBlt by 0
data.loc[data['GarageType']!='None', "GarageYrBlt"] = data["GarageYrBlt"].fillna(data['GarageYrBlt'].median())
data["GarageYrBlt"]=data["GarageYrBlt"].fillna(0)

# I fill the missing values of the column LotFrontage by the median value
data["LotFrontage"] = data.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))

# Making false numerical variables into categorical
data['YrSold'] = data['YrSold'].astype(str)
data['OverallCond'] = data['OverallCond'].astype(str)
data['MoSold'] = data['MoSold'].astype(str)
data['MSSubClass'] = data['MSSubClass'].apply(str)

# Encoding the dataframe into another one
encoded_data = pd.get_dummies(data)
encoded_data.head()

Data research¶

# study of the Base variable = SalePrice
data['SalePrice'].describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

# Displaying distribution of the target variable
data['SalePrice'].hist(bins=50)

<matplotlib.axes._subplots.AxesSubplot at 0x1e3ae114160>

# Adjusting skewed data distribution
encoded_data['SalePrice_skewed'] = np.log1p(data['SalePrice']) 
encoded_data['SalePrice_skewed'].hist(bins=50)

<matplotlib.axes._subplots.AxesSubplot at 0x1e3ae812908>

# Displyaing top 10 of data correlated to SalePrice
encoded_data[encoded_data.columns[1:]].corr()['SalePrice_skewed'][:].sort_values(ascending=False)[2:12]

OverallQual       0.817185
GrLivArea         0.700927
GarageCars        0.680625
TotalBathrooms    0.673011
GarageArea        0.650888
TotalBsmtSF       0.612134
1stFlrSF          0.596981
FullBath          0.594771
YearBuilt         0.586570
YearRemodAdd      0.565608
Name: SalePrice_skewed, dtype: float64

Statistical Analysis¶

# I choose to study the Ground Live Area vs Sale Price. There is a linear regression 
plt.figure(figsize=(20,8))
sns.regplot(x='GrLivArea', y="SalePrice_skewed", data=encoded_data, color='green')

<matplotlib.axes._subplots.AxesSubplot at 0x1e3ae046f98>

#Checking the reasons of outlier : GrLivArea > 4000
encoded_data[(encoded_data['GrLivArea']> 4000) & (encoded_data['SalePrice_skewed']<13)]

#The reason is that the SaleCondition = Partial (means Home was not completed when last assessed)
# I delete those rows.
encoded_data.drop([523,1298], inplace=True)

# I choose to study the impact of OverallQual on Sale Price.
plt.figure(figsize=(10,8))
sns.boxplot(x="OverallQual", y="SalePrice_skewed", data=encoded_data)

<matplotlib.axes._subplots.AxesSubplot at 0x1e3ae9d4cf8>

# I choose to study the Year of construction vs Sale Price. There is a linear regression 
plt.figure(figsize=(20,8))
sns.regplot(x='YearBuilt', y="SalePrice_skewed", data=encoded_data)

<matplotlib.axes._subplots.AxesSubplot at 0x1e3ae969d30>

# I choose to study the existence and surface of the basement vs Sale Price. There is a linear regression 
plt.figure(figsize=(20,10))
sns.regplot(x='TotalBsmtSF', y="SalePrice_skewed", data=encoded_data, color='purple')

<matplotlib.axes._subplots.AxesSubplot at 0x1e3aec76400>

# Impact of Full Bathrooms on Sale Price.
plt.figure(figsize=(10,8))
sns.boxplot(x="TotalBathrooms", y="SalePrice_skewed", data=encoded_data)

<matplotlib.axes._subplots.AxesSubplot at 0x1e3aeb0e358>

#Checking why there are outliers when +5 bathrooms
from math import exp
pd.set_option('display.max_columns', 500)
outliers = data[(data['TotalBathrooms']>= 5) & (data['SalePrice']<exp(13))]
outliers

# The BsmtFinType2 indicates Unfinished, this could explain the low sale price. 
# Moreover, the values in YearBuilt are below 1990 : the houses are old
# Finally, the OverallQual values are medium (5/10)

# I choose to study the existence and surface of a first floor vs Sale Price. There is a linear regression 
plt.figure(figsize=(20,10))
sns.regplot(x='1stFlrSF', y="SalePrice_skewed", data=encoded_data, color='orange')

<matplotlib.axes._subplots.AxesSubplot at 0x1e3aeb33ef0>

# Checking R-squared with the 4 most correlated features
X = sm.add_constant(encoded_data[['GrLivArea','TotalBathrooms','OverallQual','GarageCars']])
Y = encoded_data['SalePrice_skewed']

model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 

print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:       SalePrice_skewed   R-squared:                       0.818
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     1628.
Date:                Tue, 21 Jan 2020   Prob (F-statistic):               0.00
Time:                        11:09:31   Log-Likelihood:                 508.98
No. Observations:                1458   AIC:                            -1008.
Df Residuals:                    1453   BIC:                            -981.5
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             10.5226      0.021    509.651      0.000      10.482      10.563
GrLivArea          0.0002    1.2e-05     16.580      0.000       0.000       0.000
TotalBathrooms     0.0970      0.008     12.858      0.000       0.082       0.112
OverallQual        0.1313      0.005     28.580      0.000       0.122       0.140
GarageCars         0.1058      0.008     13.627      0.000       0.091       0.121
==============================================================================
Omnibus:                      175.701   Durbin-Watson:                   2.015
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              412.932
Skew:                          -0.690   Prob(JB):                     2.15e-90
Kurtosis:                       5.212   Cond. No.                     7.44e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

C:\Users\aline.debenath\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

# The model explains 82% of the variance of Sale Price !

# Even if the correlation coefficient didn't show it, I presumed that the neighborhood had an impact on the Sale Price.
plt.figure(figsize=(30,10))
sns.boxplot(x=data['Neighborhood'], y=data['SalePrice'], data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x1e3b0629e48>

# Using ANOVA to test H0:the mean SalePrice of every neighborhoods are equals  
model = ols('SalePrice ~ Neighborhood', data = data).fit()
                
anova_result = sm.stats.anova_lm(model, typ=2)
print (anova_result)

                    sum_sq      df          F         PR(>F)
Neighborhood  5.023606e+12    24.0  71.784865  1.558600e-225
Residual      4.184305e+12  1435.0        NaN            NaN

# H0 is rejected : the means are not equal, there is a difference of SalePrice between neighborhoods
# But based on our correlation result, it is not the main features to influence the SalePrice

	LotFrontage	LotArea	OverallQual	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtUnfSF	TotalBsmtSF	...	SaleType_WD	SaleCondition_Abnorml	SaleCondition_Normal
0	65.0	8450	7	2003	2003	196.0	706	150	856	...	1	0	1
1	80.0	9600	6	1976	1976	0.0	978	284	1262	...	1	0	1
2	68.0	11250	7	2001	2002	162.0	486	434	920	...	1	0	1
3	60.0	9550	7	1915	1970	0.0	216	540	756	...	1	1	0
4	84.0	14260	8	2000	2000	350.0	655	490	1145	...	1	0	1

	LotFrontage	LotArea	OverallQual	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	...	SaleType_New	SaleType_Oth	SaleType_WD	SaleCondition_Abnorml	SaleCondition_AdjLand	SaleCondition_Alloca	SaleCondition_Family	SaleCondition_Normal	SaleCondition_Partial	SalePrice_skewed
523	130.0	40094	10	2007	2008	762.0	2260	0	878	3138	...	1	0	0	0	0	0	0	0	1	12.126764
1298	313.0	63887	10	2008	2008	796.0	5644	0	466	6110	...	1	0	0	0	0	0	0	0	1	11.982935

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

	MSSubClass	MSZoning	LotFrontage	LotArea	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	Fence	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice	TotalBathrooms
738	90	RL	60.0	10800	None	Reg	Lvl	AllPub	Inside	Gtl	Edwards	Norm	Norm	Duplex	1Story	5	5	1987	1988	Gable	CompShg	Plywood	Plywood	None	0.0	TA	TA	CBlock	Gd	Gd	Gd	GLQ	1200	Unf	0	0	1200	GasA	TA	Y	SBrkr	1200	0	0	1200	3	0	3	0	3	1	TA	5	Typ	0	None	None	0.0	None	0	0	None	None	Y	120	0	0	0	0	0	None	0	3	2009	WD	Alloca	179000	6.0
921	90	RL	67.0	8777	None	Reg	Lvl	AllPub	Inside	Gtl	Edwards	Feedr	Norm	Duplex	1.5Fin	5	7	1900	2003	Gable	CompShg	MetalSd	MetalSd	None	0.0	TA	TA	CBlock	TA	TA	No	ALQ	1084	Unf	0	188	1272	GasA	Gd	Y	SBrkr	1272	928	0	2200	2	0	2	2	4	2	TA	9	Typ	0	None	None	0.0	None	0	0	None	None	N	0	70	0	0	0	0	GdPrv	0	9	2008	WD	Normal	145900	5.0