# Conventional Methods – Multiple Linear Regression

## Defination

It attempts to build a predictive model using two or more features to get result y by fitting a linear equation. y = w0 + w1x1 + w2x2 + … + wnxn

## Problem with too many features

Having too many features could cause model less accurate. So it is neccessary to select features before we train the model. We can remove features with low variance, or other methods provided by sklearn: https://scikit-learn.org/stable/modules/feature_selection.html

## Steps with code example

### Step 1: Data Preprocessing

#### Importing the libraries

``````import pandas as pd
import numpy as np
``````

#### Importing the dataset

``````dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : ,  4 ].values
``````

#### Encoding Categorical data

``````from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[: , 3] = labelencoder.fit_transform(X[ : , 3])
onehotencoder = OneHotEncoder(categorical_features = )
X = onehotencoder.fit_transform(X).toarray()
``````

#### Avoiding Dummy Variable Trap

``````X = X[: , 1:]
``````

#### Do features selection

``````from sklearn.feature_selection import VarianceThreshold

sel_X = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel_X.fit_transform(X)
``````

#### Splitting the dataset into the Training set and Test set

``````from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(sel_X, Y, test_size = 0.2, random_state = 0)
``````

### Step 2: Fitting Multiple Linear Regression to the Training set

``````from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
``````

### Step 3: Predicting the Test set results

``````y_pred = regressor.predict(X_test)
``````