14 - Introduction to Data Science

14 - Introduction to Data Science#

Data Science is a field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. It is a multidisciplinary field that combines statistics, data analysis, machine learning, data visualization, and domain expertise to understand complex problems and make data-driven decisions.

Tools used in data science include programming languages like Python, R, and SQL, as well as libraries and frameworks like Pandas, NumPy, Scikit-learn, TensorFlow, and PyTorch. Data scientists also use tools for data visualization like Matplotlib, Seaborn, and Tableau.

This lecture is just an introduction and will only treat how python tools that we’ve already learned can be used in data science.

We’ll cover the following topics:

Reading Files (CSV or Excel) using Pandas
Data Exploration and Cleaning
- Data Types
- Missing Values
- Duplicates
- Outliers
- Encoding Categorical Variables
Data Visualization
Regression (Machine Learning)

A great resource for data science training is the Kaggle platform. It has a lot of datasets and competitions that you can use to practice your data science skills.

Here we’ll use a dataset from Kaggle called “Used Cars Dataset”. You can download it here.

#First import the Pandas Package:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Reading Files with the Pandas Package#

Pandas is a python package that is very helpful in processing data including reading and writing csv files (or other text based files). Similiar to Numpy having many powerful functions and methods, there are many powerful features of the pandas package. You can read more about it here: https://pandas.pydata.org/.

A CSV file is a file that has comma separated values. It is a text file that can be opened in a text editor. For example, the following data in a text file would be a CSV file:

Time (s), Concentration (M), Temperature (K)
0, 0.0, 300
1, 0.1, 310
2, 0.2, 320
3, 0.3, 330

You could also have a CSV file that has multiple header rows:

This is a description of the data in the file that could be multirowed
Time (s), Concentration (M), Temperature (K)
0, 0.0, 300
1, 0.1, 310
2, 0.2, 320
3, 0.3, 330

or you could have a text file that has a tab character is used instead of a comma separating the values.

#Lets read in the data on the used cars
url = 'https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data'
df = pd.read_csv('../../../data/vehicles.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 26 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 id            426880 non-null  int64  
 url           426880 non-null  object 
 region        426880 non-null  object 
 region_url    426880 non-null  object 
 price         426880 non-null  int64  
 year          425675 non-null  float64
 manufacturer  409234 non-null  object 
 model         421603 non-null  object 
 condition     252776 non-null  object 
 cylinders     249202 non-null  object 
fuel          423867 non-null  object 
odometer      422480 non-null  float64
title_status  418638 non-null  object 
transmission  424324 non-null  object 
VIN           265838 non-null  object 
drive         296313 non-null  object 
size          120519 non-null  object 
type          334022 non-null  object 
paint_color   296677 non-null  object 
image_url     426812 non-null  object 
description   426810 non-null  object 
county        0 non-null       float64
state         426880 non-null  object 
lat           420331 non-null  float64
long          420331 non-null  float64
posting_date  426812 non-null  object 
dtypes: float64(5), int64(2), object(19)
memory usage: 84.7+ MB

#Referencing a column is done by using the column header name. 
# For example, to get the concentration column, you would use the following:
print(df['odometer']) #print the column named Time (s)
#To get the 30,000th value of that column I'd use the following:
i = 30000
print(df['odometer'][i],'or', df['odometer'].iloc[i], 'or', df.iloc[i]['odometer'])

0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
           ...   
426875    32226.0
426876    12029.0
426877     4174.0
426878    30112.0
426879    22716.0
Name: odometer, Length: 426880, dtype: float64
130000.0 or 130000.0 or 130000.0

That’s weird, the county column doesn’t have any non-null values. Let’s drop it.

df = df.drop(columns=['county'])

print(df.shape)
print(df.dtypes)

(426880, 25)
id                int64
url              object
region           object
region_url       object
price             int64
year            float64
manufacturer     object
model            object
condition        object
cylinders        object
fuel             object
odometer        float64
title_status     object
transmission     object
VIN              object
drive            object
size             object
type             object
paint_color      object
image_url        object
description      object
state            object
lat             float64
long            float64
posting_date     object
dtype: object

For this example, we’d like to predict the price of the vehicle using the data that we have. Some columns are not useful for this prediction like id, url, image_url, and description. We’ll drop these columns.

Models#

Anytime you create a function, you’ve just created a model. A model relates one thing to another. The variable(s) that you can control is/are the independent variable(s) and the output from the function (or model) is/are the dependent variable(s). There are model’s all around us including:

in your phone, there is a model that relates the voltage of the battery to the percentage of battery life left
in your car, there is a model that relates the vehicle speed to the amount of fuel used
in your refrigerator, there is a model that relates the temperature setting to the temperature inside the refrigerator
in your phone that relates the destination to the route and time predicted to get there

Empirical or Theoretical Models#

Models can be empirical which is what is typically meant when you complete regression: you fit data collected to an empirical relationship. For example, you might have a set of data that you fit to a linear model. The linear model is an empirical model.

Theoretical models are models that are based on theory. For example, the ideal gas law is a theoretical model. Or for ODE’s that we’ve talked about where we use the balance equations (accum = in - out + gen - cons), the relationship between the independent variables and dependent variables can be based on theory.

Here we’re looking at an empirical model. We’re going to use a linear regression model to predict the price of a vehicle based on the data that we have.

df = df.drop(columns=['id','url', 'region_url', 'image_url', 'description','posting_date'])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 19 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 region        426880 non-null  object 
 price         426880 non-null  int64  
 year          425675 non-null  float64
 manufacturer  409234 non-null  object 
 model         421603 non-null  object 
 condition     252776 non-null  object 
 cylinders     249202 non-null  object 
 fuel          423867 non-null  object 
 odometer      422480 non-null  float64
 title_status  418638 non-null  object 
transmission  424324 non-null  object 
VIN           265838 non-null  object 
drive         296313 non-null  object 
size          120519 non-null  object 
type          334022 non-null  object 
paint_color   296677 non-null  object 
state         426880 non-null  object 
lat           420331 non-null  float64
long          420331 non-null  float64
dtypes: float64(4), int64(1), object(14)
memory usage: 61.9+ MB

df['type'].value_counts().plot(kind='bar')

<Axes: xlabel='type'>

../_images/83ebef81b4283046d7c502a79590fd82e740655238f5da21b293e3aadaaec26a.png

Some of the columns are objects (strings) and some are integers. We’ll need to convert the objects to integers. That is done by encoding the objects. We’ll use the LabelEncoder from the sklearn.preprocessing package to encode the objects.

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for each in df.columns:
    if df[each].dtypes == 'object':
        df[each] = label_encoder.fit_transform(df[each].astype(str))

Lets check for outliers from the numeric columns.

df[['price','lat','long','odometer','year']].plot(kind='box', subplots=True, figsize=(12,3))
plt.show()

../_images/a86b28c98b035dd6237bf16b4a2b7ddea6f054fdb54d1a0325bea8cfc34e525c.png

Whoa, there are vehicles that have prices near 1e9?!! That’s a billion dollars! Those are outliers. We’ll remove rows that have a price greater than $40000. And let’s also just consider the last 20 years of vehicles.

df = df[df['price'] < 40000]
df = df[df['year'] > 2000]

df[['price','lat','long','odometer','year']].plot(kind='box', subplots=True, figsize=(12,3))
plt.show()

../_images/b508f20f235bbe9b2b570ede1b7162e744a667f764d38e479d535c56cea2caa0.png

There are some rows that have missing values. We’ll remove those rows.

df = df.dropna()

df['price'].plot(kind='hist', bins=50)

<Axes: ylabel='Frequency'>

../_images/0092dc9b9f5fe6f6a3345444a8d0b9937dcb2f7854c6fcbc694069443272fc2f.png

What about prices that are less than 100? We’ll remove those rows as well.

df = df[df['price'] > 100]

df.isnull().sum()

region          0
price           0
year            0
manufacturer    0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
title_status    0
transmission    0
VIN             0
drive           0
size            0
type            0
paint_color     0
state           0
lat             0
long            0
dtype: int64

Those data don’t look like there are many outliers. Let’s look at the data as correlation matrix to see if there are any correlations between the columns. First we need to import seaborn, a data visualization library.

import seaborn as sns
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, vmin=0, vmax=0.2)

<Axes: >

../_images/a3efb4b438eaa507a21aff64e179a00b4b6cda5bfd9c2bdbdab067ffbd1e5f73.png

Regression and Creating a Model for Prediction#

# Let's make a parametric model using just a subset of the variables
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# First we choose our variables
X = df.drop(columns=['price'])
y = df['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Make predictions
y_pred = model.predict(X_test)
y_pred_train = model.predict(X_train)

# Evaluate the model
r2test = r2_score(y_test, y_pred)
r2train = r2_score(y_train,y_pred_train)
print(f'R^2 Score on Test Set: {r2test}')
print(f'R^2 Score on the Training Set: {r2train}')

R^2 Score on Test Set: 0.4742875688132292
R^2 Score on the Training Set: 0.47752401445626147

What about a MAPE value for the model? Let’s calculate it.

def MAPE(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mape = MAPE(y_test, y_pred)
print(f'Mean Absolute Percentage Error: {mape}')

Mean Absolute Percentage Error: 188.24081539924998

Whoa, that is a very high MAPE value. Lets plot the predicted values against the actual values to see how the model is performing.

plt.scatter(y_train, y_pred_train)
plt.xlabel('True Price')
plt.ylabel('Predicted Price')

Text(0, 0.5, 'Predicted Price')

../_images/07ecd0b3b35f4fd03f6c50881af9806dc187ba20fd654d07e95f451b9572ea37.png

Let’s try a different model. We can use a logistic regression model so that none of our prices are negative. We’ll first need to encode the price column to be between 0 and 1.

df['price'] = df['price']/max(df['price'])

# First we choose our variables
X = df.drop(columns=['price'])
y = df['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

import statsmodels.api as sm

# Sample data (replace with your actual data)

# Use statsmodels for p-value calculation
Xsm = sm.add_constant(X_train)  # Add constant for intercept term
model2 = sm.Logit(y_train, Xsm)
result = model2.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.542361
         Iterations 7

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  price   No. Observations:               262886
Model:                          Logit   Df Residuals:                   262867
Method:                           MLE   Df Model:                           18
Date:                Sat, 19 Oct 2024   Pseudo R-squ.:                  0.1654
Time:                        23:44:46   Log-Likelihood:            -1.4258e+05
converged:                       True   LL-Null:                   -1.7084e+05
Covariance Type:            nonrobust   LLR p-value:                     0.000
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const         -243.3400      2.507    -97.072      0.000    -248.253    -238.427
region       -9.362e-05   3.66e-05     -2.560      0.010      -0.000   -2.19e-05
year             0.1206      0.001     97.102      0.000       0.118       0.123
manufacturer    -0.0015      0.000     -4.164      0.000      -0.002      -0.001
model         3.582e-06   5.61e-07      6.379      0.000    2.48e-06    4.68e-06
condition       -0.0144      0.003     -4.958      0.000      -0.020      -0.009
cylinders        0.1476      0.003     48.103      0.000       0.142       0.154
fuel            -0.1400      0.005    -28.739      0.000      -0.150      -0.130
odometer     -2.303e-06   9.11e-08    -25.275      0.000   -2.48e-06   -2.12e-06
title_status    -0.1027      0.005    -22.379      0.000      -0.112      -0.094
transmission     0.1986      0.005     43.671      0.000       0.190       0.208
VIN           -3.68e-06   1.16e-07    -31.839      0.000   -3.91e-06   -3.45e-06
drive           -0.0462      0.005    -10.040      0.000      -0.055      -0.037
size            -0.0405      0.005     -7.657      0.000      -0.051      -0.030
type             0.0060      0.001      5.523      0.000       0.004       0.008
paint_color      0.0072      0.001      6.500      0.000       0.005       0.009
state            0.0005      0.000      1.626      0.104      -0.000       0.001
lat             -0.0027      0.001     -3.436      0.001      -0.004      -0.001
long            -0.0032      0.000    -12.696      0.000      -0.004      -0.003
================================================================================

y_pred = result.predict(sm.add_constant(X_test))

plt.plot(y_test, y_pred, 'o')

[<matplotlib.lines.Line2D at 0x16fc91c70>]

../_images/602e7f0100e4e39559156d942d32eaf2840b0eb54dbed7ea5e25c43459fa3297.png

MAPE(y_test, y_pred)

185.7862129728562

r2_score(y_test, y_pred)

0.4933698705880425

Pandas Example with Vehicle Data#

Your vehicle monitors many different variables including the revolutions per minute, gas mileage, oxygen sensors, and catalytic converter temperature (unless you have an electric car) to name a few. You have access to that data as there’s an OBD II port likely below your steering wheel. I’ve collected some data from my 2019 Kia Forte that we’ll look at.

# read in the data from the csv file about weather in Ogden Utah
data = pd.read_csv('https://github.com/clint-bg/comptools/blob/main/lectures/supportfiles/kiadata.csv?raw=True')

# prepare data
data['time'] = pd.to_datetime(data['time']) # set time to datetime to help with plotting
data = data.set_index('time') #set index to time

# fill in NaNs as the data for each variable is collected at different times (and so places NaNs in the other columns)
# fill in NaNs - forward fill - fill in missing values with the previous value
data.fillna(method='ffill',inplace=True)
# fill in NaNs - backward fill
data.fillna(method='bfill',inplace=True)

# remove columns that match keywords
for dc in data.columns:
    if ("Average" in dc) or ("(total)" in dc) \
       or ("$" in dc) or ("(mA)" in dc) or ('Unnamed' in dc):
        del data[dc]

/var/folders/6d/1jr2w1qx1rnd2nkndlq4hc700000gn/T/ipykernel_36460/834267956.py:2: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  data['time'] = pd.to_datetime(data['time']) # set time to datetime to help with plotting

# print data columns
for x in data.columns:
    print(x)

# warm-ups since codes cleared ()
Absolute load value (%)
Absolute pedal position D (%)
Absolute pedal position E (%)
Absolute throttle position B (%)
Actual engine - percent torque (%)
Altitude (GPS) (feet)
Ambient air temperature (℉)
Barometric pressure (kPa)
Calculated boost (bar)
Calculated engine load value (%)
Calculated instant fuel consumption (MPG)
Calculated instant fuel rate (gal./h)
Catalyst temperature Bank 1 Sensor 1 (℉)
Catalyst temperature Bank 1 Sensor 2 (℉)
Commanded EGR duty (%)
Commanded evaporative purge (%)
Commanded throttle actuator (%)
Control module voltage (V)
Distance to empty (miles)
Distance traveled since codes cleared (miles)
Distance traveled with MIL on (miles)
Distance travelled (miles)
EGR error (%)
Engine coolant temperature (℉)
Engine Exhaust Flow Rate (g/sec)
Engine Friction - Percent Torque (%)
Engine Fuel Rate (g/sec)
Engine reference torque (N⋅m)
Engine RPM (rpm)
Engine RPM x1000 (rpm)
Evap. system vapor pressure (Pa)
Fuel economizer (based on fuel system status and throttle position) ()
Fuel level input (%) (%)
Fuel level input (V) (gallon)
Fuel used (gallon)
Fuel/Air commanded equivalence ratio ()
Instant engine power (based on fuel consumption) (hp)
Instant engine torque (based on fuel consumption) (N⋅m)
Intake air temperature (℉)
Intake manifold absolute pressure (kPa)
Long term fuel % trim - Bank 1 (%)
Long term secondary oxygen sensor trim Bank 1 (%)
MAF air flow rate (g/sec)
OBD Module Voltage (V)
Oxygen sensor 1 Wide Range Equivalence ratio ()
Oxygen sensor 2 Bank 1 Voltage (V)
Power from MAF (hp)
Relative throttle position (%)
Short term fuel % trim - Bank 1 (%)
Short term secondary oxygen sensor trim Bank 1 (%)
Speed (GPS) (mph)
Throttle position (%)
Timing advance (°)
Vehicle acceleration (g)
Vehicle Fuel Rate (g/sec)
Vehicle Odometer Reading (miles)
Vehicle speed (mph)

#rename columns
data.rename(columns={'Calculated instant fuel consumption (MPG)':'Calculated MPG'}, inplace=True)

#also reset the data for the calculated MPG to be at most 100 mpg with a lambda function
data['Calculated MPG'] = data['Calculated MPG'].apply(lambda x: 100 if x>100 else x)

select = ['Vehicle speed (mph)','Throttle position (%)',\
          'Engine RPM (rpm)', 'Vehicle acceleration (g)','Calculated MPG']
data[select].plot(kind='box', subplots=True, figsize=(12,3))
plt.show()

../_images/37e58df45d27f5402cbd1df4768f0397029f533894ffc0dc380e899f8a7e9ed6.png

Pair plot#

A pair plot can be helpful to see if any of the variables appear related

import seaborn as sns
sns.pairplot(data[select])
plt.show()

../_images/4ce10db27837c7463064142c0f36a2990de3449162ae674b07c040030bbbec59.png

Multivairate Regression Example#

First Generate Data#

def tree_height(w,f):
    return 5*w**2 + (8*f) + np.random.rand(50)*0.1+5

#generate data
tree = dict(water=np.linspace(0,1,50),fertilizer=np.linspace(0,0.5,50))
tree['height'] = tree_height(tree['water'], tree['fertilizer'])
tree = pd.DataFrame(tree)
tree.head()

	water	fertilizer	height
0	0.000000	0.000000	5.028061
1	0.020408	0.010204	5.143994
2	0.040816	0.020408	5.269390
3	0.061224	0.030612	5.315699
4	0.081633	0.040816	5.372208

tree.describe()

	water	fertilizer	height
count	50.000000	50.000000	50.000000
mean	0.500000	0.250000	8.729158
std	0.297498	0.148749	2.708134
min	0.000000	0.000000	5.028061
25%	0.250000	0.125000	6.370010
50%	0.500000	0.250000	8.296953
75%	0.750000	0.375000	10.858869
max	1.000000	0.500000	14.052998

#meshplot the dataa
W, F = np.meshgrid(tree['water'],tree['fertilizer'])
H = tree_height(W,F)

nr, nc = H.shape

fig1, ax2 = plt.subplots(layout='constrained')
CS = ax2.contourf(W, F, H, 10, cmap=plt.cm.bone)

# Note that in the following, we explicitly pass in a subset of the contour
# levels used for the filled contours.  Alternatively, we could pass in
# additional levels to provide extra resolution, or leave out the *levels*
# keyword argument to use all of the original levels.

CS2 = ax2.contour(CS, levels=CS.levels[::2], colors='gray')

ax2.set_title('Tree Height')
ax2.set_xlabel('water')
ax2.set_ylabel('fertilizer')

# Make a colorbar for the ContourSet returned by the contourf call.
cbar = fig1.colorbar(CS)
cbar.ax.set_ylabel('height')
# Add the contour line levels to the colorbar
cbar.add_lines(CS2)

../_images/42b2b29a54e0d952040e4ccfdf8bf78650e48a26f8f6f551e8bfcf2c3709736a.png

Now complete the regression, first with one variable: fertilizer#

#Complete the same above task with the statsmodels package
import statsmodels.api as sm
X = tree['fertilizer']
X = sm.add_constant(X)
y = tree['height']
#X = sm.add_constant(X) include this if you want to fit a line with an intercept
model1 = sm.OLS(y, X)
results1 = model1.fit()
print(results1.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 height   R-squared:                       0.979
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                     2258.
Date:                Sat, 19 Oct 2024   Prob (F-statistic):           5.08e-42
Time:                        23:44:53   Log-Likelihood:                -23.456
No. Observations:                  50   AIC:                             50.91
Df Residuals:                      48   BIC:                             54.74
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.2253      0.110     38.408      0.000       4.004       4.446
fertilizer    18.0156      0.379     47.515      0.000      17.253      18.778
==============================================================================
Omnibus:                        6.395   Durbin-Watson:                   0.033
Prob(Omnibus):                  0.041   Jarque-Bera (JB):                4.746
Skew:                           0.623   Prob(JB):                       0.0932
Kurtosis:                       2.149   Cond. No.                         7.22
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Zp = []; Wf = W.flatten(); Ff = F.flatten()
for i,each in enumerate(Wf):
    Zp.append(results1.params[0] + results1.params[1]*Ff[i])
Zp = np.array(Zp).reshape(50,50)

#Calculate the MAPE
np.mean((np.abs(H.flatten() - Zp.flatten()))/H.flatten())*100

19.65289315935919

fig1, ax2 = plt.subplots(layout='constrained')
CS = ax2.contourf(W, F, Zp, 10, cmap=plt.cm.bone)

# Note that in the following, we explicitly pass in a subset of the contour
# levels used for the filled contours.  Alternatively, we could pass in
# additional levels to provide extra resolution, or leave out the *levels*
# keyword argument to use all of the original levels.

CS2 = ax2.contour(CS, levels=CS.levels[::2], colors='gray')

ax2.set_title('Tree Height')
ax2.set_xlabel('water')
ax2.set_ylabel('fertilizer')

# Make a colorbar for the ContourSet returned by the contourf call.
cbar = fig1.colorbar(CS)
cbar.ax.set_ylabel('height')
# Add the contour line levels to the colorbar
cbar.add_lines(CS2)

../_images/8c87f05688029671bb768b048386e4f8b162f6c9f5b00c37ed815440eed70eaa.png

Now complete the regression with two variables: fertilizer and water#

#Complete the same above task with the statsmodels package
import statsmodels.api as sm
X = tree[['water','fertilizer']]
X = sm.add_constant(X)
y = tree['height']
model2 = sm.OLS(y, X)
results2 = model2.fit()
print(results2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 height   R-squared:                       0.979
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                     2258.
Date:                Sat, 19 Oct 2024   Prob (F-statistic):           5.08e-42
Time:                        23:44:54   Log-Likelihood:                -23.456
No. Observations:                  50   AIC:                             50.91
Df Residuals:                      48   BIC:                             54.74
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.2253      0.110     38.408      0.000       4.004       4.446
water          7.2062      0.152     47.515      0.000       6.901       7.511
fertilizer     3.6031      0.076     47.515      0.000       3.451       3.756
==============================================================================
Omnibus:                        6.395   Durbin-Watson:                   0.033
Prob(Omnibus):                  0.041   Jarque-Bera (JB):                4.746
Skew:                           0.623   Prob(JB):                       0.0932
Kurtosis:                       2.149   Cond. No.                     1.11e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.45e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Zp2 = []
for i,each in enumerate(Wf):
    Zp2.append(results2.params[0]+results2.params[1]*Wf[i]+results2.params[2]*Ff[i])
Zp2 = np.array(Zp2).reshape(50,50)

fig1, ax2 = plt.subplots(layout='constrained')
CS = ax2.contourf(W, F, Zp2, 10, cmap=plt.cm.bone)

# Note that in the following, we explicitly pass in a subset of the contour
# levels used for the filled contours.  Alternatively, we could pass in
# additional levels to provide extra resolution, or leave out the *levels*
# keyword argument to use all of the original levels.

CS2 = ax2.contour(CS, levels=CS.levels[::2], colors='gray')

ax2.set_title('Tree Height')
ax2.set_xlabel('water')
ax2.set_ylabel('fertilizer')

# Make a colorbar for the ContourSet returned by the contourf call.
cbar = fig1.colorbar(CS)
cbar.ax.set_ylabel('height')
# Add the contour line levels to the colorbar
cbar.add_lines(CS2)

../_images/c05e45cee7099ec77a0cadf6e1eb46bbdf3bb1798cdb46796031e22dee62c368.png

#Calculate the MAPE
np.mean((np.abs(H.flatten() - Zp2.flatten()))/H.flatten())*100

9.913107089766699