California Housing Prices#

[1]:

import kagglehub
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator

[2]:

path = kagglehub.dataset_download("camnugent/california-housing-prices")
file = os.listdir(path)[0]

Using Colab cache for faster access to the 'california-housing-prices' dataset.

About Dataset#

https://www.kaggle.com/datasets/camnugent/california-housing-prices/data

Features#

Feature’s Name	Information
longitude	A measure of how far west a house is; a higher value is farther west
latitude	A measure of how far north a house is; a higher value is farther north
housingMedianAge	Median age of a house within a block; a lower number is a newer building
totalRooms	Total number of rooms within a block
totalBedrooms	Total number of bedrooms within a block
population	Total number of people residing within a block
households	Total number of households, a group of people residing within a home unit, for a block
medianIncome	Median income for households within a block of houses (measured in tens of thousands of US Dollars)
medianHouseValue	Median house value for households within a block (measured in US Dollars)
oceanProximity	Location of the house w.r.t ocean/sea

Objectives#

Creating a model for predicting the house pricing given some informations.

Important, the data must be cleaned.

Acknowledgements#

This data was initially featured in the following paper: Pace, R. Kelley, and Ronald Barry. “Sparse spatial autoregressions.” Statistics & Probability Letters 33.3 (1997): 291-297.

This dataset is a modified version of the California Housing dataset available from: Luís Torgo’s page (University of Porto)

[3]:

# Reading file
df = pd.read_csv("/".join([path,file]))
df.head()

[3]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

[4]:

df.describe()

[4]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

[5]:

# Checking the features
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

[6]:

# Checking the missing values in the dataset
df.isnull().sum()

[6]:

	0
longitude	0
latitude	0
housing_median_age	0
total_rooms	0
total_bedrooms	207
population	0
households	0
median_income	0
median_house_value	0
ocean_proximity	0

dtype: int64

[7]:

# Filling the missing value
df['total_bedrooms'] = df['total_bedrooms'].fillna(df['total_bedrooms'].median())
df.isnull().sum()

[7]:

	0
longitude	0
latitude	0
housing_median_age	0
total_rooms	0
total_bedrooms	0
population	0
households	0
median_income	0
median_house_value	0
ocean_proximity	0

dtype: int64

[8]:

# Checking the outliers for numerical features
df_OnlyFloat = df.select_dtypes('float64')
df_OnlyFloatCols = df_OnlyFloat.columns

fig, ax = plt.subplots(figsize=(10,8), ncols=3, nrows=3)
k = 0
for i in range(3):
  for j in range(3):
    getFeatureCol = df_OnlyFloatCols[k]
    sns.histplot(df_OnlyFloat[getFeatureCol], kde=True, ax=ax[i][j])
    plt.tight_layout()
    k += 1

_images/California_Housing_Prices_9_0.png

[9]:

fig, ax = plt.subplots(figsize=(12,6))
sns.scatterplot(data=df, x='median_income', y='median_house_value', hue='ocean_proximity', ax=ax)
ax.set_ylabel('Median House Value (in $)')
ax.set_xlabel('Median Income (in $10.000)')

[9]:

Text(0.5, 0, 'Median Income (in $10.000)')

_images/California_Housing_Prices_10_1.png

[10]:

fig, ax = plt.subplots(figsize=(12,6))
sns.scatterplot(data=df, x='longitude', y='latitude', hue='ocean_proximity', edgecolor='black', linewidth=0.1)

ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Block of Houses in California')

[10]:

Text(0.5, 1.0, 'Block of Houses in California')

_images/California_Housing_Prices_11_1.png

[11]:

df_OceanProximityAgg = df.groupby(by='ocean_proximity').agg(
    {'median_income' : 'sum', 'median_house_value' : 'sum'}
    )
df_OceanProximityAgg = df_OceanProximityAgg.sort_values(by='median_house_value', ascending=False)

[12]:

fig, ax = plt.subplots(figsize=(12,6), ncols=2)
sns.barplot(data=df_OceanProximityAgg, x='ocean_proximity', y='median_house_value', ax=ax[0])
ax[0].set_ylabel('Total Median House Value (in $)')
ax[0].set_xlabel('Ocean Proximity')

sns.barplot(data=df_OceanProximityAgg, x='ocean_proximity', y='median_income', ax=ax[1])
ax[1].set_ylabel('Total Median Income (in $10.000)')
ax[1].set_xlabel('Ocean Proximity')

[12]:

Text(0.5, 0, 'Ocean Proximity')

_images/California_Housing_Prices_13_1.png

For the ocean_proximity, it must be encoded to numerical data.

[13]:

mapper = {'<1H OCEAN' : 1, 'INLAND' : 2, 'NEAR OCEAN' : 3, 'NEAR BAY' : 4, 'ISLAND' : 5}

df['ocean_proximity_encoded'] = df['ocean_proximity'].map(mapper)

df.head()

[13]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	ocean_proximity_encoded
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY	4
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY	4
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY	4
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY	4
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY	4

[14]:

# plotting the pairplot to see the correlation for pair of features

fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(df.corr(numeric_only=True), cmap='viridis', linewidths=.2, fmt='.3f', annot=True, ax=ax)

[14]:

<Axes: >

_images/California_Housing_Prices_16_1.png

[15]:

# Checking the outliers for numerical features
df_OnlyFloat = df.select_dtypes('float64')
df_OnlyFloatCols = df_OnlyFloat.columns

fig, ax = plt.subplots(figsize=(10,8), ncols=3, nrows=3)
k = 0
for i in range(3):
  for j in range(3):
    getFeatureCol = df_OnlyFloatCols[k]
    sns.boxplot(df_OnlyFloat[getFeatureCol], ax=ax[i][j])
    plt.tight_layout()
    k += 1

_images/California_Housing_Prices_17_0.png

[16]:

df_clean = df.copy(deep=True)

[17]:

# Removing outlier using IQR method
features = ['total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']
n_iteration = 6

for i in range(n_iteration):
  for feature in features:
    Q1 = df_clean[feature].quantile(0.25)
    Q3 = df_clean[feature].quantile(0.75)
    IQR = Q3 - Q1

    # Filtering
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df_clean = df_clean[(df_clean[feature] >= lower_bound) & (df_clean[feature] <= upper_bound)]

[19]:

# Checking the outliers for numerical features
df_OnlyFloat = df_clean.select_dtypes('float64')
df_OnlyFloatCols = df_OnlyFloat.columns

fig, ax = plt.subplots(figsize=(10,8), ncols=3, nrows=3)
k = 0
for i in range(3):
  for j in range(3):
    getFeatureCol = df_OnlyFloatCols[k]
    sns.boxplot(df_OnlyFloat[getFeatureCol], ax=ax[i][j])
    plt.tight_layout()
    k += 1

_images/California_Housing_Prices_20_0.png

[20]:

# Checking the outliers for numerical features, except for longitude and latitude
df_OnlyFloat = df_clean.select_dtypes('float64')
df_OnlyFloatCols = df_OnlyFloat.columns

fig, ax = plt.subplots(figsize=(10,8), ncols=3, nrows=3)
k = 0
for i in range(3):
  for j in range(3):
    getFeatureCol = df_OnlyFloatCols[k]
    sns.histplot(df_OnlyFloat[getFeatureCol], kde=True, ax=ax[i][j])
    plt.tight_layout()
    k += 1

_images/California_Housing_Prices_21_0.png

[21]:

fig, ax = plt.subplots(figsize=(12,6))
sns.scatterplot(data=df_clean, x='median_income', y='median_house_value', hue='ocean_proximity', ax=ax)
ax.set_ylabel('Median House Value (in $)')
ax.set_xlabel('Median Income (in $10.000)')

[21]:

Text(0.5, 0, 'Median Income (in $10.000)')

_images/California_Housing_Prices_22_1.png

[22]:

# plotting the pairplot to see the correlation for pair of features

fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(df_clean.corr(numeric_only=True), cmap='viridis', linewidths=.2, fmt='.3f', annot=True, ax=ax)

[22]:

<Axes: >

_images/California_Housing_Prices_23_1.png

[23]:

# The median_income has clear positive correlation with the median house value
# The increasing the median income, the increase the median house value.
# However, the increasing median house value is less significant due to the total
# population, house median age, households, total rooms, and total bedrooms.
# In this case, the median_house_value is the target for creating a linear model
# The rest of features will be as the data

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, root_mean_squared_error, mean_absolute_percentage_error
from sklearn.model_selection import train_test_split

[24]:

scaler = StandardScaler().set_output(transform="pandas")

[25]:

y = df_clean['median_house_value']
x = df_clean.drop(['median_house_value', 'ocean_proximity', 'longitude', 'latitude'], axis=1)

# split the dataset into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Scalling
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

# Training the dataset using LinearRegression
linear_model = LinearRegression()
linear_model.fit(x_train_scaled, y_train)


y_pred = linear_model.predict(x_test_scaled)

# evaluating model
r2 = r2_score(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mae_p = mean_absolute_percentage_error(y_test, y_pred)

print(f"R2: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE(%): {mae_p:.4f}")

R2: 0.5113
RMSE: 59195.7423
MAE(%): 0.3103

California Housing Prices

Contents

California Housing Prices#

About Dataset#

Features#

Objectives#

Acknowledgements#