The aim is to build a model which predicts sales based on the money spent on different platforms such as TV, radio, and newspaper for marketing by using simple linear regression and multiple linear regression.

Data Source : Kaggle.com

Python Libraries used

  1. Pandas

  2. NumPy

  3. Matplotib

  4. Seaborn

  5. from sklearn.model_selection import train_test_split

  6. from sklearn import metrics

Pre-processing operations 1 Checking for missing values 2. Checking for duplicate values 3. Checking for outliers/extreme values

Exploratory Data Analysis

  1. Distribution of the target variable

  2. How sales is related to other independent variables

  3. Correlation between the variables

Model Building

Prediction using:

  1. Simple Linear Regression

  2. Multiple Linear Regression

I am setting the stage for the sales prediction model by importing necessary libraries and reading in a dataset from a CSV file. The dataset contains amounts spent on TV, radio, and newspaper advertising and corresponding sales figures. I then conduct data pre-processing to check for missing values in the dataset, finding none, which indicates the dataset is clean and ready for modeling.

I'm using linear regression, which is a foundational statistical method used to predict a quantitative response. Here, I am focusing on simple linear regression, which relates two variables with a straight line. The formula given is Y = β0 + β1X + ε, where Y is the dependent variable, β0 is the y-intercept of the regression line, β1 is the slope of the line, X is the independent variable, and ε is the error term. Using the scikit-learn library in Python, I've set up a simple linear regression model where I predict sales based on TV advertising budget. After training the model, I display the coefficients, revealing that the sales prediction equation is Sales = 6.948 + 0.054 * TV.

I move from simple to multiple linear regression, which considers more than one predictor variable to predict a response. I emphasize that multiple linear regression is powerful because it models relationships between a single dependent continuous variable and multiple independent variables. The formula generalizes to Y = β0 + β1X1 + β2X2 + ... + βnXn + ε, where X1, X2, ..., Xn are independent variables. After training a multiple linear regression model with TV, radio, and newspaper as predictor variables for sales, I display the model coefficients, which are weights assigned to each predictor. The prediction equation for the test set is formed by plugging these values into the regression formula.