September 25, 2023

Topics Learnt Today:

1: Bootstrap: 

  • Bootstrap is a resampling technique used for statistical inference, such as estimating the sampling distribution of a statistic or constructing confidence intervals.
  • It involves repeatedly sampling from the dataset with replacement to create multiple bootstrap samples, each of the same size as the original dataset.
  • The statistic of interest (e.g., mean, median, standard error) is calculated for each bootstrap sample, and the distribution of these statistics is used to make inferences.
  • Bootstrap can also be applied to estimate prediction error by resampling the dataset and calculating the error metric (e.g., mean squared error) for each resampled dataset.
  • Advantages:
    • Provides an empirical estimate of the sampling distribution and can be used to construct confidence intervals.
    • Useful for making inferences about population parameters and assessing the stability of statistical estimates.
  • Disadvantages:
    • Does not directly provide model evaluation or performance estimation, unlike cross-validation.
    • May not be as straightforward for model selection or hyperparameter tuning compared to cross-validation.
  • In summary, Bootstrap, is used for statistical inference, constructing confidence intervals, and estimating population parameters.

2: Training Error vs Test Error:

Training error and test error are two key concepts in machine learning and model evaluation. They provide insights into how well a machine learning model is performing during training and how it is likely to perform on unseen data. Understanding the differences between these two types of errors is essential for assessing a model’s generalization capability and identifying issues like overfitting or underfitting.

Training Error:

  • Training error, also known as in-sample error, is the error or loss that a machine learning model incurs on the same dataset that was used to train it.
  • When you train a model, it learns to fit the training data as closely as possible. The training error measures how well the model fits the training data.
  • A model that has a low training error is said to have a good fit to the training data. However, a low training error does not necessarily indicate that the model will generalize well to new, unseen data.
  • Training error tends to be overly optimistic because the model has already seen the training data and has adapted to it, potentially capturing noise and specific patterns that may not generalize to other datasets.

Test Error:

  • Test error, also known as out-of-sample error or validation error, is the error or loss that a machine learning model incurs on a dataset that it has never seen during training. This dataset is called the validation or test set.
  • The test error provides an estimate of how well the model is likely to perform on new, unseen data. It helps assess the model’s generalization capability.
  • A model with a low test error is expected to make accurate predictions on new data, indicating good generalization.
  • Test error is a more reliable indicator of a model’s performance on real-world data because it measures how well the model can generalize beyond the training data.
  • In summary, training error measures how well a model fits the training data, while test error provides an estimate of a model’s performance on new, unseen data.

3: Validation-set Approach: 

The validation-set approach is a technique used in machine learning and statistical model evaluation to assess a model’s performance and tune its hyperparameters. It’s particularly useful when you have a limited amount of data and want to estimate how well your model will generalize to new, unseen data. Here’s how the validation-set approach works:

  1. Data Splitting: The first step is to divide your dataset into three distinct subsets: a training set, a validation set, and a test set. The typical split ratios are 60-70% for training, 15-20% for validation, and 15-20% for testing, but these ratios can vary depending on the size of your dataset.
    • Training Set: This subset is used to train the machine learning model. The model learns patterns and relationships in the data from this set.
    • Validation Set: The validation set is used for hyperparameter tuning and model selection. It serves as an independent dataset to evaluate the model’s performance under various hyperparameter settings.
    • Test Set: The test set is a completely independent dataset that the model has never seen during training or hyperparameter tuning. It is used to provide an unbiased estimate of the model’s generalization performance.
  2. Model Training and Hyperparameter Tuning: With the training set, you train the machine learning model using various hyperparameter settings. The goal is to find the set of hyperparameters that yields the best performance on the validation set. This process often involves iteratively adjusting hyperparameters and evaluating the model on the validation set until satisfactory performance is achieved.
  3. Model Evaluation: After hyperparameter tuning is complete, you have a final model with the best hyperparameters. You then evaluate this model’s performance on the test set. The test set provides an unbiased estimate of how well the model is likely to perform on new, unseen data.
  4. Performance Metrics: You can use various evaluation metrics, depending on the type of problem you’re addressing. Common metrics include accuracy, precision, recall, F1 score for classification problems, and mean squared error (MSE), root mean squared error (RMSE), or R-squared for regression problems.
  5. Iterative Process: It’s important to note that the validation-set approach can involve an iterative process of model training, hyperparameter tuning, and evaluation. This process helps ensure that the model is well-tuned and performs optimally on unseen data.
  6. Caution: While the validation-set approach is a valuable technique for model evaluation and hyperparameter tuning, it’s essential to avoid data leakage. Data leakage occurs when information from the validation set or test set unintentionally influences the model training process. Ensure that you use the validation set only for tuning hyperparameters and the test set only for final evaluation.

September 22, 2023

Topics Learnt Today:

1: Polynomial Regression: Polynomial regression is a type of regression analysis used when the relationship between the independent variable(s) and the dependent variable is not linear but can be approximated by a polynomial function.

  • Polynomial regression allows for modeling non-linear relationships between variables by introducing higher-order terms (e.g., XSquare, XCube) into the regression equation.

2: Logistic Regression: 

  • Logistic regression is a type of regression analysis used for predicting binary or categorical outcomes. It’s used when the dependent variable is binary (e.g., 0/1, Yes/No, True/False).
  • The logistic regression model uses the logistic (sigmoid) function to map the linear combination of independent variables to the probability of belonging to one of the categories.

3: Step Function: 

  • A step function, also known as a Heaviside step function or a unit step function, is a mathematical function that returns a constant value (usually 0 or 1) depending on whether its argument is greater than or equal to a threshold value.
  • It’s often used in engineering and physics to model discontinuous changes or events. In binary classification problems, it’s sometimes used to represent binary outcomes.

4: State Function:

  • A state function, in the context of state-space models, represents the internal state of a dynamic system. State-space models are commonly used in control theory, engineering, and various scientific fields.
  • State-space models describe a system using two equations: the state equation (describing how the internal state evolves over time) and the measurement equation (relating the internal state to observed measurements).
  • In control theory, state functions are used to represent variables such as position, velocity, and acceleration, and they are essential for designing control systems.

September 20, 2023

Topics Learnt Today:
1: Predictive Model: 

A predictive model is a mathematical or computational representation of a real-world system or phenomenon that is used to make predictions or forecasts about future events or outcomes based on historical data and patterns. Predictive models are a fundamental component of machine learning, data analysis, and statistics, and they find applications in various fields, including finance, healthcare, marketing, and more.

Here are some key aspects and components of predictive models:

  1. Data Collection: Predictive models require historical data to learn from. This data typically includes information about the system being modeled and the outcomes of interest. Data collection can involve various sources, such as sensors, databases, surveys, or web scraping.
  2. Features: Features, also known as predictors or independent variables, are the variables or attributes from the data that the model uses to make predictions. Feature selection and engineering are critical steps in model development to choose the most relevant and informative features.
  3. Target Variable: The target variable, also known as the dependent variable, is the variable the model aims to predict. It represents the outcome or event of interest. For example, in a credit scoring model, the target variable might be whether a person will default on a loan or not.
  4. Model Selection: Choosing an appropriate predictive model is a crucial step. The choice of model depends on the nature of the data (e.g., regression for continuous outcomes, classification for categorical outcomes) and the specific problem being addressed. Common models include linear regression, decision trees, random forests, support vector machines, and neural networks, among others.
  5. Training: Training a predictive model involves using historical data to teach the model how to make predictions. During training, the model learns the relationships between the features and the target variable. The goal is to minimize prediction errors on the training data.
  6. Validation and Testing: After training, the model’s performance is evaluated using validation and testing datasets. Validation helps tune hyperparameters and assess model performance during development, while testing provides an estimate of how well the model will perform on new, unseen data.
  7. Evaluation Metrics: Various evaluation metrics are used to assess the quality of predictions made by the model. Common metrics include accuracy, precision, recall, F1 score, mean squared error (MSE), and root mean squared error (RMSE), depending on the type of problem (classification or regression).
  8. Deployment: Once a predictive model has been trained and tested, it can be deployed in a real-world application. Deployment involves integrating the model into a software system or process to make automated predictions on new data.
  9. Monitoring and Maintenance: Predictive models may require ongoing monitoring and maintenance to ensure they continue to provide accurate predictions. Data drift, changes in the distribution of data, and shifts in the underlying relationships can impact a model’s performance over time.
  10. Retraining: Periodic retraining of the model with updated data is often necessary to maintain its predictive accuracy. Models can become stale if not regularly refreshed with new information.

 

2: Chi Square Regression: Chi-square regression, also known as Poisson regression or log-linear regression, is a statistical regression model used for analyzing count data or frequency data, where the dependent variable represents counts or occurrences of an event in a fixed unit of observation. This type of regression is particularly suitable when the assumptions of linear regression, such as normally distributed residuals, are not met, and the data exhibit a Poisson or count distribution.

Applications of chi-square regression include analyzing data from fields such as epidemiology (e.g., disease incidence), social sciences (e.g., survey responses), and manufacturing (e.g., defect counts). It is especially useful when dealing with data that exhibit a count distribution, and it provides a way to model and interpret relationships between predictors and counts while accounting for the inherent nature of the data.

September 18, 2023

Topics Learnt today:

Overfitting is a common problem in machine learning and statistical modeling. It occurs when a model learns the training data too well, capturing noise or random fluctuations in the data rather than the underlying patterns or relationships. As a result, an overfit model performs very well on the training data but poorly on unseen or new data, which is the ultimate goal of any predictive model. Overfitting is a form of model bias, where the model is too complex for the given data.

Key characteristics and causes of overfitting:

  1. Complex Models: Models that are excessively complex, with too many parameters or features, are prone to overfitting. They have the capacity to fit noise in the data, which leads to poor generalization.
  2. Small Dataset: Overfitting is more likely to occur when you have a small dataset because there is not enough data to capture the true underlying patterns. With limited data, the model may fit the noise instead.
  3. High Model Flexibility: Models with high flexibility or capacity, such as deep neural networks or decision trees with many branches, are susceptible to overfitting. They can adapt too closely to the training data.
  4. Lack of Regularization: Regularization techniques like L1 and L2 regularization or dropout in neural networks are used to control overfitting by adding constraints to the model’s parameters. If these techniques are not used when necessary, overfitting can occur.
  5. Noise in Data: If the training data contains noise or errors, the model might try to fit that noise, leading to overfitting. Clean and well-preprocessed data is important to reduce the risk of overfitting.
  6. Feature Engineering: Including too many irrelevant or redundant features in the model can contribute to overfitting. Feature selection or dimensionality reduction techniques can help mitigate this issue.
  7. Early Stopping: In the training of iterative models (e.g., neural networks), if you train for too many epochs, the model may start overfitting. Early stopping, which involves monitoring the model’s performance on a validation set and stopping training when performance starts to degrade, can help prevent this.
  8. Cross-Validation: Not using cross-validation to assess model performance can lead to overfitting. Cross-validation helps in estimating how well a model will generalize to unseen data.———————————————————————————————–

Cross-validation is a technique used in machine learning and statistics to assess the performance and generalization ability of a predictive model. Its primary purpose is to estimate how well a model will perform on unseen data, which helps in avoiding overfitting (a model that fits the training data too closely but performs poorly on new data) and provides a more accurate evaluation of a model’s capabilities.

Here’s how cross-validation works:

  1. Data Splitting: The first step is to divide the available dataset into two or more subsets: typically, a training set and a testing (or validation) set. The training set is used to train the machine learning model, while the testing set is used to evaluate its performance.
  2. K-Fold Cross-Validation: The most common form of cross-validation is k-fold cross-validation, where the dataset is divided into ‘k’ subsets of approximately equal size. The model is trained and evaluated ‘k’ times, using a different subset as the validation set in each iteration. For example, in 5-fold cross-validation, the dataset is divided into 5 subsets, and the model is trained and tested five times, with each subset serving as the validation set once.
  3. Performance Metrics: In each fold or iteration, the model’s performance is measured using evaluation metrics such as accuracy, precision, recall, F1 score, or mean squared error, depending on the type of problem (classification or regression) you’re solving.
  4. Average Performance: After all k iterations are complete, the performance metrics are averaged across the k folds to obtain a single evaluation score. This score provides an estimate of how well the model is likely to perform on new, unseen data.

Advantages of Cross-Validation:

  1. Robustness: It provides a more robust estimate of a model’s performance because it uses multiple validation sets rather than just one.
  2. Avoiding Overfitting: Cross-validation helps in detecting overfitting because the model is evaluated on different data subsets. If the model performs well across all folds, it’s more likely to generalize well to new data.
  3. Optimal Parameter Tuning: Cross-validation is often used for hyperparameter tuning, allowing you to choose the best set of hyperparameters for your model.

 

September 15, 2023

Topics Learnt Today:

Multi Linear Regression

A dependent variable and two or more independent variables, often known as predictors or explanatory variables, are modelled using the statistical technique known as multiple linear regression (MLR). It is a development of straightforward linear regression, which takes into account just one independent variable. MLR tries to analyse and quantify the links between various independent variables and the dependent variable.

1: Coefficient Interpretation: The coefficients (1, 2, 3, etc.) show how strongly and in which direction each independent variable and the dependent variable are related. For instance, if 1 is positive, it implies, supposing all other variables are constant, that an increase in X1 is related with an increase in Y.

2: Intercept: When all independent variables are zero, the intercept (0) reflects the estimated value of the dependent variable. Depending on the context of your data, this value might not always have a relevant interpretation.

3: Assumptions: Multiple linear regression makes the following assumptions: homoscedasticity, normal distribution, and independence of the residuals (the discrepancies between the actual values of Y and the values predicted by the model). The reliability of the regression results may be impacted by violations of these presumptions.

4: Model Evaluation: To evaluate the goodness of fit of the model and ascertain if it sufficiently explains the variability in the dependent variable, a variety of statistical approaches, including hypothesis testing, R-squared, and adjusted R-squared, can be utilised.

5: Multicollinearity: This phenomenon happens when there is a strong correlation between two or more independent variables in a model. As a result, figuring out the unique contributions of each variable might be difficult. Multicollinearity can be found and addressed using techniques like the variance inflation factor (VIF).

September 13, 2023

1: p-value – The p-value is a statistical measure that is commonly used in hypothesis testing to assess the strength of evidence against a null hypothesis.

  • Small p-value (typically ≤ α): Strong evidence against the null hypothesis. Researchers may conclude that there is a significant effect or difference, supporting the alternative hypothesis.
  • Large p-value (typically > α): Weak evidence against the null hypothesis. Researchers do not have enough evidence to reject the null hypothesis.2: Breusch-Pagan test – Significance of p-value in this test is as follows:
  • If p-value ≤ α: This indicates that there is strong evidence to reject the null hypothesis. In other words, you conclude that there is heteroscedasticity in the regression model, suggesting that the variance of the error term is not constant across the levels of the independent variables.
  • If p-value > α: This suggests that there is not enough evidence to reject the null hypothesis. In this case, you would conclude that there is no significant heteroscedasticity in the regression model, and it is reasonable to assume that the variance of the error term is constant across the levels of the independent variables.
  • In summary, the p-value in the Breusch-Pagan test helps you assess whether there is heteroscedasticity in your regression model. If the p-value is low (typically less than 0.05), you conclude that there is evidence of heteroscedasticity, which can have implications for the validity of your regression analysis. If the p-value is high, you do not have strong evidence to suggest heteroscedasticity, and you can proceed with more confidence in the assumptions of homoscedasticity.3: Chi-Square Distribution – As the degrees of freedom increase, the chi-square distribution becomes more bell-shaped and approaches a normal distribution (the central limit theorem applies). Chi-square distributions are an essential tool in statistical analysis, particularly for drawing inferences about population variances, testing hypotheses, and assessing relationships between categorical variables.

September 11, 2023

Topics learnt in today’s class: (Simple Linear Regression)

1: Skewness – The term “skewness” in simple linear regression describes the asymmetry of the residuals’ distribution, which might have an effect on the reliability of our regression model and how we should interpret its findings. To verify the accuracy of our regression analysis, it’s critical to look for skewness in the residuals and, if necessary, take the proper remedial action.

2: Kurtosis – In simple linear regression, kurtosis describes how the residuals are distributed and might reveal whether they have longer or shorter tails when compared to a normal distribution. When analysing regression data, it is crucial to consider kurtosis since severe kurtosis can affect the validity of our regression results and necessitate corrective actions to assure the accuracy of our research. For a normal distribution kurtosis is 3 but in the diabetes dataset we have the kurtosis values as 4 which is not exactly a normal distribution.

3: Heteroscedasticity – The differences between the observed values of the dependent variable and the values predicted by the regression model is known as heteroscedasticity. In other words, the spread of the residuals changes as you move along the values of the independent variable. Inferences about the statistical significance of the regression coefficients may be made incorrectly as a result of heteroscedasticity. Standard errors, in particular, may be under- or over-estimated, which has an impact on the accuracy of parameter estimations and can result in inaccurate assessments of the significance of predictors. Least squares estimations may no longer be the most effective estimators of the regression coefficients when heteroscedasticity is present. The statistical power of our analysis can be decreased by inefficient estimation.