{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Multiple Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's grab a data set of of car values:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PriceMileageMakeModelTrimTypeCylinderLiterDoorsCruiseSoundLeather
017314.1031298221BuickCenturySedan 4DSedan63.14111
117542.0360839135BuickCenturySedan 4DSedan63.14110
216218.84786213196BuickCenturySedan 4DSedan63.14110
316336.91314016342BuickCenturySedan 4DSedan63.14100
416339.17032419832BuickCenturySedan 4DSedan63.14101
\n", "
" ], "text/plain": [ " Price Mileage Make Model Trim Type Cylinder Liter \\\n", "0 17314.103129 8221 Buick Century Sedan 4D Sedan 6 3.1 \n", "1 17542.036083 9135 Buick Century Sedan 4D Sedan 6 3.1 \n", "2 16218.847862 13196 Buick Century Sedan 4D Sedan 6 3.1 \n", "3 16336.913140 16342 Buick Century Sedan 4D Sedan 6 3.1 \n", "4 16339.170324 19832 Buick Century Sedan 4D Sedan 6 3.1 \n", "\n", " Doors Cruise Sound Leather \n", "0 4 1 1 1 \n", "1 4 1 1 0 \n", "2 4 1 1 0 \n", "3 4 1 0 0 \n", "4 4 1 0 1 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_excel('https://admintuts.tech/wp-content/downloads/xls/cars.xls')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Mileage Price\n", "Mileage \n", "(0, 10000] 5588.629630 24096.714451\n", "(10000, 20000] 15898.496183 21955.979607\n", "(20000, 30000] 24114.407104 20278.606252\n", "(30000, 40000] 33610.338710 19463.670267\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import numpy as np\n", "\n", "df1 = df[['Mileage','Price']]\n", "bins = np.arange(0,50000,10000)\n", "groups = df1.groupby(pd.cut(df1['Mileage'],bins)).mean()\n", "\n", "print(groups.head())\n", "groups['Price'].plot.line()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.\n", "\n", "Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.\n", "\n", "Let's scale our feature data into the same range so we can easily compare the coefficients we end up with." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Mileage Cylinder Doors\n", "0 -1.417485 0.527410 0.556279\n", "1 -1.305902 0.527410 0.556279\n", "2 -0.810128 0.527410 0.556279\n", "3 -0.426058 0.527410 0.556279\n", "4 0.000008 0.527410 0.556279\n", "5 0.293493 0.527410 0.556279\n", "6 0.335001 0.527410 0.556279\n", "7 0.382369 0.527410 0.556279\n", "8 0.511409 0.527410 0.556279\n", "9 0.914768 0.527410 0.556279\n", "10 -1.171368 0.527410 0.556279\n", "11 -0.581834 0.527410 0.556279\n", "12 -0.390532 0.527410 0.556279\n", "13 -0.003899 0.527410 0.556279\n", "14 0.430591 0.527410 0.556279\n", "15 0.480156 0.527410 0.556279\n", "16 0.509822 0.527410 0.556279\n", "17 0.757160 0.527410 0.556279\n", "18 1.594886 0.527410 0.556279\n", "19 1.810849 0.527410 0.556279\n", "20 -1.326046 0.527410 0.556279\n", "21 -1.129860 0.527410 0.556279\n", "22 -0.667658 0.527410 0.556279\n", "23 -0.405792 0.527410 0.556279\n", "24 -0.112796 0.527410 0.556279\n", "25 -0.044552 0.527410 0.556279\n", "26 0.190700 0.527410 0.556279\n", "27 0.337442 0.527410 0.556279\n", "28 0.566102 0.527410 0.556279\n", "29 0.660837 0.527410 0.556279\n", ".. ... ... ...\n", "774 -0.161262 -0.914896 0.556279\n", "775 -0.089234 -0.914896 0.556279\n", "776 -0.040523 -0.914896 0.556279\n", "777 0.002572 -0.914896 0.556279\n", "778 0.236603 -0.914896 0.556279\n", "779 0.249666 -0.914896 0.556279\n", "780 0.357220 -0.914896 0.556279\n", "781 0.365521 -0.914896 0.556279\n", "782 0.434131 -0.914896 0.556279\n", "783 0.517269 -0.914896 0.556279\n", "784 0.589908 -0.914896 0.556279\n", "785 0.599186 -0.914896 0.556279\n", "786 0.793052 -0.914896 0.556279\n", "787 1.033554 -0.914896 0.556279\n", "788 1.045762 -0.914896 0.556279\n", "789 1.205567 -0.914896 0.556279\n", "790 1.541414 -0.914896 0.556279\n", "791 1.561070 -0.914896 0.556279\n", "792 1.725026 -0.914896 0.556279\n", "793 1.851502 -0.914896 0.556279\n", "794 -1.709871 0.527410 0.556279\n", "795 -1.474375 0.527410 0.556279\n", "796 -1.187849 0.527410 0.556279\n", "797 -1.079929 0.527410 0.556279\n", "798 -0.682430 0.527410 0.556279\n", "799 -0.439853 0.527410 0.556279\n", "800 -0.089966 0.527410 0.556279\n", "801 0.079605 0.527410 0.556279\n", "802 0.750446 0.527410 0.556279\n", "803 1.932565 0.527410 0.556279\n", "\n", "[804 rows x 3 columns]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/nikolas/Desktop/venv/lib/python3.6/site-packages/ipykernel_launcher.py:8: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.\n", " \n", "/home/nikolas/Desktop/venv/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.\n", " warnings.warn(msg, DataConversionWarning)\n", "/home/nikolas/Desktop/venv/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.\n", " warnings.warn(msg, DataConversionWarning)\n", "/home/nikolas/Desktop/venv/lib/python3.6/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " \n", "/home/nikolas/Desktop/venv/lib/python3.6/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " self.obj[item] = s\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: Price R-squared (uncentered): 0.064
Model: OLS Adj. R-squared (uncentered): 0.060
Method: Least Squares F-statistic: 18.11
Date: Sun, 01 Sep 2019 Prob (F-statistic): 2.23e-11
Time: 03:30:21 Log-Likelihood: -9207.1
No. Observations: 804 AIC: 1.842e+04
Df Residuals: 801 BIC: 1.843e+04
Df Model: 3
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Mileage -1272.3412 804.623 -1.581 0.114 -2851.759 307.077
Cylinder 5587.4472 804.509 6.945 0.000 4008.252 7166.642
Doors -1404.5513 804.275 -1.746 0.081 -2983.288 174.185
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 157.913 Durbin-Watson: 0.008
Prob(Omnibus): 0.000 Jarque-Bera (JB): 257.529
Skew: 1.278 Prob(JB): 1.20e-56
Kurtosis: 4.074 Cond. No. 1.03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "=======================================================================================\n", "Dep. Variable: Price R-squared (uncentered): 0.064\n", "Model: OLS Adj. R-squared (uncentered): 0.060\n", "Method: Least Squares F-statistic: 18.11\n", "Date: Sun, 01 Sep 2019 Prob (F-statistic): 2.23e-11\n", "Time: 03:30:21 Log-Likelihood: -9207.1\n", "No. Observations: 804 AIC: 1.842e+04\n", "Df Residuals: 801 BIC: 1.843e+04\n", "Df Model: 3 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Mileage -1272.3412 804.623 -1.581 0.114 -2851.759 307.077\n", "Cylinder 5587.4472 804.509 6.945 0.000 4008.252 7166.642\n", "Doors -1404.5513 804.275 -1.746 0.081 -2983.288 174.185\n", "==============================================================================\n", "Omnibus: 157.913 Durbin-Watson: 0.008\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 257.529\n", "Skew: 1.278 Prob(JB): 1.20e-56\n", "Kurtosis: 4.074 Cond. No. 1.03\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import statsmodels.api as sm\n", "from sklearn.preprocessing import StandardScaler\n", "scale = StandardScaler()\n", "\n", "X = df[['Mileage', 'Cylinder', 'Doors']]\n", "y = df['Price']\n", "\n", "X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())\n", "\n", "print (X)\n", "\n", "est = sm.OLS(y, X).fit()\n", "\n", "est.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table of coefficients above gives us the values to plug into an equation of form:\n", " B0 + B1 * Mileage + B2 * cylinders + B3 * doors\n", " \n", "In this example, it's pretty clear that the number of cylinders is more important than anything based on the coefficients.\n", "\n", "Could we have figured that out earlier?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Doors\n", "2 23807.135520\n", "4 20580.670749\n", "Name: Price, dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.groupby(df.Doors).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0.02051781 1.96971667 0.55627894]]\n", "[10198.25991671]\n" ] } ], "source": [ "scaled = scale.transform([[20000, 8, 4]])\n", "print(scaled)\n", "predicted = est.predict(scaled[0])\n", "print(predicted)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 1 }