MTH 522 – Advanced Mathematical Statistics

Linear Regression:

Model Evaluation of Train Data:

Scatter Plot of Predicted vs Actual Values:
• scatter plot with the x-axis representing the actual values (y_train) and the y-axis representing the predicted values (y_pred).
• The closer the points are to the diagonal line (not explicitly shown but implied), the better the model’s predictions match the actual data.
• The points seem to align well along an increasing diagonal line, suggesting a good fit between the model’s predictions and actual values, especially as the total jobs number increases.
Distribution of Residuals:
• The second image is a histogram overlaid with a kernel density estimate that shows the distribution of the model’s residuals, which are the differences between the actual values and the predicted values.
• Ideally, the residuals should be normally distributed around zero, indicating that the model’s predictions are unbiased.
• The distribution looks approximately normal and centered around zero, which is a good sign, although there seems to be a slight right skew.

Model Evaluation of Validation Data:

Scatter Plot of Predicted vs Actual Values:

This plot compares the actual values (y-val) on the x-axis with the predicted values on the y-axis. Ideally, if predictions were perfect, all points would lie on the diagonal line which equals the predictions. The scatter shows that the model’s predictions are reasonably close to the actual values, although there is some variance, especially in the middle range of the actual values.
Distribution of Residuals:
The residuals are the differences between the actual and predicted values. This histogram shows the distribution of these residuals, with a superimposed kernel density estimate (KDE). The residuals seem to be approximately normally distributed, with a mean close to zero. This is a good sign, indicating that the model does not systematically overpredict or underpredict the total number of jobs. However, there is a noticeable spread, suggesting that there are predictions that are significantly off from the actual values, which is also reflected in the scatter plot.

December 6, 2023December 12, 2023

December 6, 2023

The pairplot shows the relationships and distributions of different economic indicators within clusters determined by KMeans clustering. Each scatter plot’s axes represent two indicators, while the density plots on the diagonal show the distribution of single variables, colored by cluster.

Analysis:
• Clusters are formed based on the inherent groupings in the multidimensional data. Each cluster represents a grouping of data points that are like each other.
• The scatter plots demonstrate how these clusters are distributed with respect to two indicators at a time.
• The distribution plots (on the diagonal) indicate the range and density of each variable within each cluster, with some variables showing distinct peaks for different clusters.

From the clusters, we can infer that certain combinations of economic indicators are common suggesting potential correlations or influences between these indicators

December 4, 2023December 12, 2023

December 4, 2023

Total Jobs vs Logan Passengers:
There is a positive relationship between the number of passengers at Logan Airport and the total number of jobs.

The R-Squared value is approximately 0.729, suggesting that about 72.9% of the variability in total jobs can be explained by the number of Logan passengers.

The p-value is extremely low (approximately 3.57×10-15), indicating a statistically significant relationship.

Total Jobs vs Logan International Flights:
Similarly, the number of international flights has a positive correlation with the total number of jobs.

The R-Squared value is 0.764, meaning that approximately 76.4% of the variability in total jobs is accounted for by the number of international flights.

The p-value is very small (around 3.04×10-17), which implies a statistically significant relationship.

Total Jobs vs Hotel Occupancy Rate:
The relationship between hotel occupancy rates and total jobs is weaker compared to the previous two variables.

The R-Squared value is about 0.142, indicating that only 14.2% of the variability in total jobs is explained
by the hotel occupancy rate.

The p-value is approximately 0.197, which is above the typical significance level of 0.05, suggesting that the relationship might not be statistically significant.

Total Jobs vs Hotel Average Daily Rate:

There is a moderate positive relationship between the average daily rate of hotels and total jobs.

The R-Squared value is 0.313, which means that about 31.3% of the variability in total jobs can be explained by the hotel average daily rate.

The p-value is approximately 0.0038, indicating a statistically significant relationship at common significance levels.

Total Jobs vs Unemployment Rate:

There is a strong negative relationship between the unemployment rate and the total number of jobs, which is intuitive as higher unemployment would typically be associated with fewer jobs.

The R-Squared value is about 0.872, suggesting that 87.2% of the variability in total jobs can be explained by the unemployment rate.

The p-value is extremely low (around 4.10 x 10-27), indicating a very strong statistically significant relationship.

December 1, 2023December 9, 2023

December 1, 2023

Box Plots:

Hotel Occupancy Rate: The median hotel occupancy rate in the Boston area is between 72.5% and 77.5%. The spread of the data is smaller than that of the number of passengers and international flights, with most months having an occupancy rate between 65% and 85%.
Hotel Avg Daily Rate: The median hotel average daily rate in the Boston area is between $240 and $265 per night. The spread of the data is larger than that of the hotel occupancy rate, with some nights having rates as low as $200 and others having rates as high as $300.
Hotel occupancy rates are also relatively stable, but there is a wider range of possible rates.
Hotel average daily rates vary more significantly than any of the other variables.

November 29, 2023December 9, 2023

November 29, 2023

Box Plots:

Logan Passengers: The median number of passengers at Logan International Airport is between 2.75 and 3.25 million per month. There is a significant spread in the data, with some months having as few as 1.5 million passengers and others having as many as 5 million.
Logan Intl Flights: The median number of international flights at Logan International Airport is between 3,250 and 3,750 per month. The spread of the data is similar to that of the number of passengers, with some months having as few as 2,500 flights and others having as many as 5,250.
The number of passengers and international flights at Logan International Airport is relatively stable throughout the year, with only slight variations from month to month.
The data suggests that the tourism industry in the Boston area is relatively stable and predictable.

November 27, 2023December 9, 2023

November 27, 2023

Histograms:

Histogram of Hotel Occupancy Rate:

The most common hotel occupancy rate in the Boston area is between 70% and 80%.
Hotel occupancy rates in the Boston area have been declining slightly in recent years, but remain relatively high.

Histogram of Hotel Avg Daily Rate:

The most common hotel average daily rate in the Boston area is between $225 and $275 per night.
Hotel average daily rates in the Boston area have been increasing steadily over time.

November 24, 2023December 9, 2023

November 24, 2023

Histograms:

Histogram of Logan Passengers:

The most common number of passengers at Logan International Airport is between 2.5 and 3.5 million per month.
The number of passengers at Logan International Airport has been increasing steadily over time, with a peak of over 5 million passengers in December 2023.

Histogram of Logan Intl Flights:

The most common number of international flights at Logan International Airport is between 3000 and 4000 per month.
The number of international flights at Logan International Airport has also been increasing steadily over time, with a peak of over 5000 flights in December 2023.

November 22, 2023December 9, 2023

November 22, 2023

Pair Plots for Relationship between Variables:

1: There is a positive correlation between logan_intl_flights and logan_passengers. This means that as the number of international flights at Logan International Airport increases, the number of passengers at the airport also tends to increase. This is likely because Logan is a major hub for both domestic and international flights.

2: There is a positive correlation between logan_passengers and hotel_avg_daily_rate. This means that as the number of passengers at Logan International Airport increases, the average daily rate of hotels in the area also tends to increase. This is likely because an increase in demand for hotel rooms drives up prices.

3: There is a positive correlation between logan_passengers and hotel_occup_rate. This means that as the number of passengers at Logan International Airport increases, the occupancy rate of hotels in the area also tends to increase. This is likely due to the same reason as the previous point: an increase in demand for hotel rooms drives up prices and occupancy rates.

In addition to the above inferences, the scatter plots also reveal some interesting trends:

1: The relationship between logan_passengers and hotel_avg_daily_rate is stronger than the relationship between logan_passengers and hotel_occup_rate. This means that hotels are more likely to raise prices in response to an increase in demand than they are to increase occupancy rates.

2: The relationship between all three variables appears to be linear. This means that the change in one variable is proportional to the change in the other variables.

November 20, 2023December 9, 2023

November 20, 2023

Correlation Matrix:

1:Logan International Airport (BOS) has the highest number of passengers and hotel occupancy rates. This is likely because BOS is a major hub for both domestic and international flights.

2:Hotel occupancy rates are generally higher than the number of passengers. This suggests that many people who travel to the United States are staying in hotels, even if they are not arriving or departing through BOS.

3: There is a positive correlation between the number of passengers and hotel occupancy rates. This means that as the number of passengers increases, hotel occupancy rates also tend to increase.

Trends:

1: The number of passengers at BOS has been increasing steadily over time. This suggests that the airport is becoming more popular with travelers.

2: Hotel occupancy rates at BOS have been declining slightly in recent years. This could be due to several factors, such as the rise of Airbnb and other home-sharing platforms.

3: The correlation between the number of passengers and hotel occupancy rates has been weakening in recent years. This suggests that other factors, such as the economy and the availability of alternative accommodations, are becoming more influential in determining hotel occupancy rates.

November 17, 2023December 9, 2023

November 17, 2023

Decided to work on Hotel Market effects on Tourism:

1: logan_passengers:

The mean number of Logan airport passengers is approximately 3,015,647.
The standard deviation is around 549,276, indicating some variability in the number of passengers.
The minimum and maximum values are roughly 1,878,731 and 4,120,937, respectively.

2: logan_intl_flights:

The mean number of international flights is approximately 3,940.51.
The standard deviation is approximately 694.48.
The minimum and maximum values are 2,587 and 5,260, respectively.

3: hotel_occup_rate:

The mean hotel occupancy rate is approximately 81.77%.
The standard deviation is about 10.86%.
The minimum and maximum values are 57.2% and 93.1%, respectively.

4: hotel_avg_daily_rate:

The mean hotel average daily rate is approximately $244.42.
The standard deviation is around $49.76.
The minimum and maximum values are $157.89 and $337.92, respectively.

Interpretations:

1: Logan Passengers and International Flights:

The mean values provide a central tendency for the number of Logan airport passengers and international flights.
The standard deviations indicate the variability around these means.

2: Hotel Occupancy Rate:

The mean hotel occupancy rate of approximately 81.77% suggests a relatively high average occupancy.
The variability (standard deviation) of around 10.86% indicates some fluctuations in hotel occupancy.

3: Hotel Average Daily Rate:

The mean hotel average daily rate of approximately $244.42 provides an average pricing benchmark.
The standard deviation of $49.76 suggests some variability in hotel pricing.

November 15, 2023December 9, 2023

November 15, 2023

Real Estate: Board Approved Development Projects (Pipeline):

pipeline_unit (Units): Approximately 468.95 units are approved on average for development projects.
The data may exhibit an anomaly since the minimum is negative.
The average total development cost, or pipeline_total_dev_cost, is approximately $480,481,700.
The cost ranges from $0 at the minimum to $2,755,500,000 at the maximum.
sqft (pipeline, in square feet): The approved projects have an average square footage of about 992,537.
The range of the square footage is 0 to 4,714,445.
pipeline_const_jobs (Construction Jobs): For projects that are approved, the average number of construction jobs is roughly 801.73. 3,976 is the maximum, and 0 is the minimum.

Real Estate Market: Housing:

foreclosure_pet (Foreclosure Petitions): There are typically 13.23 foreclosure petitions filed each year. Between 0 and 69 is the range.
foreclosure_deeds: The mean quantity of foreclosure deeds is approximately 3.77.
Med_housing_price (Median Housing Sales Price): The range is 0 to 17. The median price of a home sold on average is about $167,327.85. In some cases, the median price is reported as 0.
housing_sales_vol (Volume of Housing Sales): Approximately 269.61 houses are sold on average. There are 0 to 2508 in the range.
New Housing Construction Permits: The mean quantity of permits issued for new housing construction is approximately 132.89. The range is 0 to 897.
new-affordable_housing_permits (New Affordable Housing Unit Permits): The average number of permits for new affordable housing construction is approximately 23.13. The range is from 0 to 232.

November 13, 2023December 9, 2023

November 13, 2023

The dataset seems to encompass a number of topics pertaining to Boston’s real estate, labor market, hotel industry, and tourism sector.
This dataset offers an extensive perspective of diverse economic metrics in Boston, facilitating the examination and investigation of patterns and connections among diverse industries.

Key statistics for various variables over 84 observations (months or years) in Boston are summarized in the data that is provided.

Month and Year:

The information is available from 2013 to 2019.
The number 6.5 stands for an average month.
Travel:
logan_passengers (Passenger Traffic at Logan): 3.02 million people travel through Logan Airport on average.
There is a minimum of roughly 1.88 million and a maximum of roughly 4.12 million.
Logan International Flights (logan_intl_flights): There are roughly 3940.51 international flights on average.
2587 is the minimum and 5260 is the maximum.
Hotel Market:
hotel_occup_rate (Occupancy Rate): 81.77% is the average hotel occupancy rate.
93.1% is the highest rate, and 57.2% is the lowest.
The average daily rate, or hotel_avg_daily_rate: The average cost of a hotel room is $244.42 per day.
$157.89 is the minimum rate and $337.92 is the maximum.

November 12, 2023December 16, 2023

Project-2

MTH_Project_2

November 10, 2023November 13, 2023

November 10, 2023

Topics Learnt Today:
Clustering methods for the project:
The Silhouette Scores, which serve as indicators of clustering quality, have been calculated for different clustering algorithms, each applied with five clusters. Detailed explanation of each method is below:

1: KMedoids Clustering (n_clusters=5):

Silhouette Score: 0.37
Interpretation: The score of 0.37 suggests moderate cohesion and separation between clusters. Points within clusters are reasonably well-matched to neighboring clusters. It indicates that there is some distinguishability between the clusters, but the separation is not exceptionally strong.

2: KMeans Clustering (n_clusters=5):

Silhouette Score: 0.44
Interpretation: The higher score of 0.44 indicates good cohesion and separation between clusters. Points within clusters are well-matched to neighboring clusters, signifying a more distinct and well-defined clustering compared to KMedoids. The clusters are relatively well-separated.

3: DBSCAN Clustering (eps=0.5, min_samples=5):

Silhouette Score: -1
Interpretation: The negative score of -1 is concerning. It suggests potential issues with the clustering quality, indicating that the DBSCAN algorithm may not be suitable for the given data and parameter settings. A negative silhouette score implies that points are inappropriately assigned to clusters, and the algorithm struggles to define meaningful clusters with the specified parameters.

In summary, the Silhouette Scores provide insights into the performance of different clustering algorithms. KMeans exhibits the highest score (0.44), indicating more distinct and well-separated clusters compared to KMedoids and DBSCAN. The negative score for DBSCAN suggests challenges in forming meaningful clusters with the specified parameters, highlighting potential issues in the clustering process for this algorithm in the given context.

November 8, 2023November 13, 2023

November 8, 2023

Topics Learnt Today:

The provided boxplot illustrates the age distribution of individuals who were killed, categorized by their race, denoted by letters (A, W, H, B, O, N) likely corresponding to Asian, White, Hispanic, Black, Other, and Native American. Here’s a descriptive analysis of the boxplot:

Asian (A): The median age is approximately in the mid-30s, and there is a relatively symmetrical spread of ages within the interquartile range (IQR) from the mid-20s to mid-40s. Numerous outliers suggest a significant number of cases with ages deviating from the central tendency, spanning from young adults to those in their late 60s or early 70s.

White (W): The median age is similar to that of the Asian category, in the mid-30s, but the IQR has a broader spread from the early 20s to late 40s. Outliers indicate individuals outside the typical age range, both younger and notably older, with a cluster of older-age outliers.

Hispanic (H): The median age is slightly lower than that of Asian and White categories, potentially in the early 30s. The age distribution is compact, with an IQR similar to the Asian category. There are outliers on the higher age end, but fewer than in the White category.

Black (B): The median age for this group is also in the early 30s, with a tight IQR, indicating less variability in age within the quartiles compared to the White category. Outliers are present, indicating ages both much younger and older than the median.

Other (O): The median age in this category seems to be in the early 30s, with an IQR comparable to that of the Hispanic and Black categories. There are a few outliers, suggesting the presence of individuals significantly older than the median.

Native American (N): The median age for Native Americans is similar to that of the Other category, with an IQR slightly wider but comparable to other minority groups. Outliers indicate ages higher than the typical range.

Overall, the median ages across the races do not vary significantly, with most medians lying in the 30s. White individuals exhibit a broader age range with older-age outliers, whereas other racial categories have tighter age distributions with fewer outliers.

November 6, 2023November 13, 2023

November 6, 2023

Topics Learnt Today:
1: White People

The age distribution of White individuals in the dataset displays a moderately right-skewed pattern, featuring a median of 38.0 and a mean of 40.09. The data indicates a relatively widespread distribution, as illustrated by a standard deviation of 13.24 and a variance of 175.26, suggesting a considerable degree of variability in the ages of White individuals.

The positive skewness value of 0.52 provides further insight into the distribution’s characteristics. Skewness measures the asymmetry of a distribution, and in this context, a positive skewness of 0.52 indicates a tail on the right side. This implies that there are relatively more White individuals with ages higher than the median, contributing to the rightward skew.

Furthermore, the negative kurtosis value of -0.13 sheds light on the tails and overall shape of the distribution. Kurtosis measures the tail heaviness of a distribution, and a negative kurtosis of -0.13 suggests slightly lighter tails compared to a normal distribution. This suggests that the age distribution among White individuals has tails that are less pronounced, and the overall shape of the distribution is somewhat flatter at the peak compared to a normal distribution.

2: Other Age Groups

The age distribution of individuals categorized as “Other” in the dataset is characterized by a median of 31.0 and a mean of 33.47. The standard deviation (11.48) and variance (131.83) suggest a moderate degree of variability in the dataset.

The positive skewness value of 0.63 provides additional information about the distribution’s shape. Skewness measures the asymmetry of a distribution, and in this case, a positive skewness of 0.63 indicates a right-skewed distribution with a tail on the right side. This suggests that there are relatively more individuals in the “Other” category with ages higher than the median, contributing to the rightward skew.

The negative kurtosis value of -0.23 gives insight into the tails and overall shape of the distribution. Kurtosis measures the tail heaviness of a distribution, and a negative kurtosis of -0.23 implies slightly lighter tails compared to a normal distribution. Additionally, the negative kurtosis suggests a flatter peak, indicating that the distribution among individuals categorized as “Other” is less concentrated around the mean compared to a normal distribution.

November 3, 2023November 13, 2023

November 3, 2023

Topics Learnt Today:
1: Hispanics

The age distribution of Hispanic individuals in the dataset demonstrates a moderately right-skewed pattern, as indicated by a median of 33.0 and a mean of 33.73. The data exhibits a relatively lower level of dispersion, evident through a standard deviation of 10.59 and a variance of 112.13, suggesting that there is less variability in the distribution of ages among Hispanic individuals.

The positive skewness value of 0.77 adds more detail to the distribution. Skewness measures the asymmetry of a distribution, and in this context, a positive skewness of 0.77 indicates a tail on the right side. This suggests that there are relatively more Hispanic individuals with ages higher than the median, contributing to the rightward skew.

Furthermore, the positive kurtosis value of 0.69 provides insight into the tails and overall shape of the distribution. Kurtosis measures the tail heaviness of a distribution, and a positive kurtosis of 0.69 suggests slightly heavier tails compared to a normal distribution. This implies that the age distribution among Hispanic individuals has tails that are more pronounced, and the overall shape of the distribution is somewhat more concentrated around the mean.

2: Native Americans:

The age distribution of Native American individuals in the dataset is characterized by a moderately right-skewed pattern, with a median of 32.0 and a mean of 32.92. The data exhibits a relatively narrow spread, as evidenced by a standard deviation of 9.38 and a variance of 87.92, indicating a lesser degree of variability in the distribution of ages among Native American individuals.

The positive skewness value of 0.50 provides additional insight into the distribution. Skewness measures the asymmetry of a distribution, and in this context, a positive skewness of 0.50 suggests a tail on the right side. This implies that there are relatively more Native American individuals with ages higher than the median, contributing to the rightward skew.

Moreover, the negative kurtosis value of -0.17 offers information about the tails and overall shape of the distribution. Kurtosis measures the tail heaviness of a distribution, and a negative kurtosis of -0.17 suggests slightly lighter tails compared to a normal distribution. This indicates that the age distribution among Native American individuals has tails that are less pronounced, and the overall shape of the distribution is somewhat flatter at the peak compared to a normal distribution.

November 1, 2023November 13, 2023

November 1, 2023

Topics Learnt Today:

1: Asians:

The distribution of ages among Asians in the dataset is symmetrical, as evidenced by a median of 35.0 and a mean of 36.48. The dataset displays a moderate level of dispersion, as indicated by a standard deviation of 12.21. The variance, calculated as 149, signifies a notable degree of variability in the age distribution. A skewness of 0.26 indicates a slight rightward tail, suggesting that there is a minor asymmetry towards higher age values. Additionally, the kurtosis value of -0.79 implies that the distribution has flatter tails compared to a normal distribution, indicating a relatively less peaked distribution.

2: Blacks

The age distribution of Black individuals in the dataset exhibits a right-skewed pattern, as reflected by a median of 31.0 and a mean of 32.74. The dataset’s dispersion is of moderate extent, as evidenced by a standard deviation of 11.34 and a variance of 128.62, indicating variability in the distribution of ages among Black individuals.

The positive skewness value of 1.01 further characterizes the distribution. Skewness measures the asymmetry of a distribution. In this context, a positive skewness of 1.01 suggests a tail extending towards higher age values. This implies that there are relatively more Black individuals with ages higher than the median, contributing to the rightward skew.

Moreover, the positive kurtosis value of 0.99 is indicative of the distribution having heavier tails and a more peaked shape compared to a normal distribution. Kurtosis measures the tail heaviness of a distribution. In this case, a positive kurtosis suggests that the tails of the age distribution among Black individuals are more pronounced than those in a normal distribution, and the overall shape of the distribution is more concentrated around the mean.

October 30, 2023November 13, 2023

October 30, 2023

Topics Learnt Today:

White Mean Age: The White group’s mean age is roughly 40.09 years old.
Black Mean Age: The Black population is roughly 32.74 years old on average.
Mean Difference: The two groups’ mean ages differ by an absolute 7.35 years (Mean Age White – Mean Age Black).
T-Statistic: The mean difference from zero is expressed as a number of standard deviations. The T-Statistic in this instance is 18.46, suggesting a significant difference in the mean ages of the two groups.
P-Value: Based on the assumption that there is no difference between the groups (the null hypothesis), this is the likelihood of obtaining a T-Statistic as extreme as the one observed. Strong evidence against the null hypothesis is shown by a very low P-Value of 1.797e-72, or 1.797 x 10^(-72). Stated differently, it seems improbable that the observed variation in mean ages is the result of pure chance.
More or Equal Simulated Mean Differences: There seems to be a connection to simulations. Researchers occasionally utilize simulations in hypothesis testing to produce a range of test statistics presuming the validity of the null hypothesis. Since this number is 0, it’s possible that none of the simulated mean differences exceeded or were equal to the mean difference that was observed.
Total Simulations: The total number of simulations run, in this example 2,000,000.

In conclusion, the evidence to reject the null hypothesis is that there is no difference in the mean ages between the White and Black groups—based on the low P-Value and the T-Statistic. The information indicates that there is a statistically significant difference in the two groups’ mean ages, with the White group’s mean age being greater than the Black group’s.

October 27, 2023November 8, 2023

October 27, 2023

Topics Learnt Today:

1: One-Tailed Test:

For the scenario: Average age in State A is higher than the national average age.
Null Hypothesis (H0): State A’s average age is either the same as or less than the average age of the country.
The alternative hypothesis (H1) is that State A’s average age is noticeably older than the average age of the country.
Average age in State A is statistically higher than the national average in this one-tail (right-tail) test.

2: Two-Tailed Test:

For the scenario: Number of fatal police shootings during different months of the year in a city within the US.
The number of gunshot occurrences in a given month is equal to the annual average, according to the null hypothesis (H0).
The number of gunshot occurrences in at least one month differs significantly from the annual average, according to the alternative hypothesis (H1).
Regardless of whether the monthly shooting event count is higher or lower than the annual average, there is a substantial deviation from the norm using this two-tail test.

October 23, 2023November 8, 2023

October 23, 2023

Topics Learnt Today:

K-medoids:

Strengths:
K-medoids, also known as PAM (Partitioning Around Medoids), is a more robust alternative to K-means. It identifies clusters based on representative points called medoids. These medoids are actual data points within the dataset, making them more suitable for handling irregularly shaped clusters.
Unlike K-means, K-medoids is less sensitive to outliers, making it a better choice when dealing with data containing noise or extreme values.
Weaknesses:
While K-medoids is more versatile than K-means in terms of cluster shape, it can still struggle with very large datasets due to its computational complexity.

October 20, 2023November 8, 2023

October 20, 2023

Topics Learnt Today:

1: K-Means:

Strengths:
K-means excels at identifying clusters of data points that are close to each other spatially. It operates by partitioning data into a predetermined number of clusters, with each cluster having a centroid point. The algorithm groups data points based on their proximity to these centroids. It is particularly effective when dealing with clusters of relatively uniform size and shape, making it suitable for situations where clusters are spherical or have similar geometries.
Weaknesses:
K-means struggles with irregularly shaped clusters, as it assumes that clusters are spherical and uniform in size. When clusters are elongated or have varying sizes, K-means may produce suboptimal results, leading to the mixing of data points between clusters.

2: DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

Strengths:
DBSCAN excels at finding clusters of varying shapes and sizes. It identifies clusters based on the density of data points, allowing it to uncover clusters of different geometries.
DBSCAN is particularly adept at identifying outliers within the dataset, as it doesn’t force data points into clusters if they don’t meet density criteria.
Weaknesses:
DBSCAN may require careful parameter tuning, such as the radius of the neighborhood around each point, to yield optimal results. In some cases, inappropriate parameter choices can lead to under-segmentation or over-segmentation of the data.

October 18, 2023November 8, 2023

October 18, 2023

Topics Learnt today:

1: Exploratory Data Analysis:

The dataset from The Washington Post data repository, specifically focuses on fatal police shootings in the United States.
This exploration involved an in-depth analysis of the dataset’s structure, its columns, and the types of data they contained. The primary goal was to gain a clear understanding of how the dataset was organized.
Identification of missing data or outliers within the dataset is done. Addressed missing values and outliers in the dataset for accurate and reliable analysis.

2: Data Cleaning:

In the data cleaning process, I focused mainly on the variables age, race, and flee. The missing data in these variables is imputed by values when suitable, ensuring that incomplete records were handled appropriately.
The outliers is being dealt with to prevent extreme values from negatively impacting the results of their analysis. By addressing missing data and handling outliers, the dataset’s quality is enhanced, making it more suitable for accurate and reliable analysis.

October 16, 2023November 8, 2023

October 16, 2023

Topics Learnt Today:

1: Clustering algorithms are employed to group similar incidents, such as fatal police shootings, based on their spatial proximity.

Clustering refers to the process of identifying groups of data points that are located close to each other in space.
The purpose of clustering in this scenario is to detect patterns or trends in the spatial distribution of these incidents.
It shows us that there are specific geographic areas with a higher incidence of fatal police shootings compared to others.
Clustering can provide valuable insights into the spatial relationships between incidents, potentially aiding law enforcement agencies, policymakers, and researchers in understanding the underlying factors contributing to the occurrences.

2: The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm is mentioned as a specific method for clustering the data related to fatal police shootings.

DBSCAN is a popular clustering algorithm that can identify clusters of data points in a dataset while also recognizing and labeling noise points. By applying DBSCAN to the data with carefully chosen parameters, it will group data points into clusters based on their spatial proximity.
Data points that do not belong to any cluster are considered noise points. This process assists in discerning geographic regions with a higher density of incidents and isolating areas with fewer occurrences.
DBSCAN is known for its ability to handle clusters of various shapes and sizes, making it suitable for spatial analysis of incidents like fatal police shootings. The algorithm is valuable for identifying patterns and trends in the data and can provide a deeper understanding of the spatial distribution of these incidents.

October 13, 2023November 8, 2023

October 13, 2023

Topics learnt today:
1: Creating a “Point” Object with the Geopy Library:

I’ve defined the latitude and longitude values for a specific location and created a Point object by passing these values to the Point() constructor after which, I accessed the latitude and longitude of the Point object using the latitude and longitude attributes.
The purpose of this is often to work with geospatial data in various data analysis or mapping tasks.

2: Geographical heat maps serve as a valuable tool for visualizing data related to fatal police shootings.

The histogram and the smooth histogram show where these incidents occur with greater frequency.
This visualization technique allows individuals to identify areas with a higher concentration of fatal police shootings, providing a clear and intuitive picture of the regions where the problem is most significant.
Geographical heat maps often use color gradients to represent the density of incidents, with darker colors indicating higher concentrations

October 11, 2023November 8, 2023

October 11, 2023

Topics Learnt Today:
1: Formula for calculating the distance between two latitude and longitude points: distance = geodesic(point1, point2).kilometers

This formula is valuable for establishing the foundation of clustering procedures, particularly when dealing with geographic data.
It leverages the geodesic function to compute the distance between two geographical coordinates in kilometers.
Additionally, it is worth noting that by replacing .kilometers with .miles, the distance can be calculated in miles, providing flexibility in unit selection for distance measurement.

2: An alternative statistical approach to comparing factors such as age and race is ANOVA test.

Instead of conducting 15 separate t-tests, which could be inefficient and cumbersome when examining multiple variables, we can use an Analysis of Variance (ANOVA) test.
ANOVA offers a more comprehensive and efficient way to assess the relationships between multiple factors simultaneously.
This approach can provide a holistic view of the impact of these factors on the data, avoiding the need for multiple separate tests and facilitating a better understanding of their combined effects.

October 9, 2023December 16, 2023

Project-1 Report

MTH_522_PROJECT_1

October 7, 2023

October 6, 2023

Relationship between %Inactive, %Obese, %Diabetic using Pearson correlation:

Pearson Correlation Coefficient between % INACTIVE and % OBESE: 0.47265609987121526

The positive sign of the correlation coefficient (0.4727) indicates a positive linear relationship between the two variables, % INACTIVE and % OBESE. This means that as one variable increases, the other tends to increase as well, and vice versa.
The value of 0.4727 suggests a moderate positive correlation. It is not extremely close to 1, which would indicate a perfect positive linear relationship, nor is it close to 0, which would suggest no linear relationship. Instead, it falls in the range between 0 and 1, indicating a moderate strength of association.
This positive correlation suggests that regions or data points with higher percentages of physical inactivity (% INACTIVE) tend to also have higher percentages of obesity (% OBESE). Conversely, regions with lower physical inactivity percentages tend to have lower obesity percentages.
Pearson Correlation Coefficient of approximately 0.4727 between % INACTIVE and % OBESE indicates a moderate positive linear relationship between physical inactivity and obesity

October 7, 2023

October 4, 2023

Mean comparison of inactivity and obesity:

Points observed:
1: Obesity rates are higher than physical inactivity rates.
2: Obesity rate has more substantial impact on the dataset than physical inactivity rate.

Histogram of %inactive and %obese

Points Observed:
1: Obesity and Inactivity have different trends but have some correlation between them.
2: Inactivity has a wider spread of the data which mean that it has higher variability.
3: Obesity has comparatively less spread of the data which means it has less variability.

October 6, 2023

October 2, 2023

Inactivity dataset mathematical statistics:

From the inactivity dataset statistics we can observe that:
1: The median value is 16.7, indicating that approximately half of the data points fall below this value, and half fall above it.
2: The mean is slightly lower than the median (16.7), suggesting a slight negative skewness.
3: A standard deviation of 1.9253 indicates that the data values have moderate variability around the mean.
4: A negative skewness value (-0.3420) indicates that the data distribution is slightly left-skewed, with a tail extending to the left. This suggests that there may be a slight concentration of data points on the right side of the distribution, toward higher values.
5: A negative kurtosis value (-0.5490) suggests that the data distribution has lighter tails than a normal distribution (platykurtic). This indicates a lower likelihood of extreme values compared to a normal distribution.

The inactivity dataset statistics has a slight negative skewness (left-skewed), where the mean is slightly lower than the median. The standard deviation indicates moderate variability in the data, and the negative kurtosis suggests that the distribution has lighter tails compared to a normal distribution, indicating a lower likelihood of extreme values. The overall shape of the distribution is relatively close to a normal distribution.

October 6, 2023

September 29, 2023

Obesity dataset mathematical statistics:

From the statistics we can observe that:
1: The median value is 18.3, indicating that approximately half of the data points fall below this value, and half fall above it.

2: The mean is very close to the median (18.3), indicating that the data distribution is approximately symmetric.

3: A standard deviation of 1.0369 indicates that the data values are relatively close to the mean, with low variability.

4: A negative skewness value (-2.6851) indicates that the data distribution is strongly negatively skewed (left-skewed), with a tail extending to the left. This suggests that most of the data points are concentrated on the right side of the distribution, toward lower values.

5: A kurtosis value of 12.3225 suggests that the data distribution has very heavy tails compared to a normal distribution (very leptokurtic). This indicates a high likelihood of extreme values compared to a normal distribution.

The obesity dataset statistics describe a strong negative skewness (left-skewed) and very heavy tails. The mean is close to the median, suggesting approximate symmetry, but the strong negative skewness and high kurtosis indicate that the data distribution has a significant concentration of values on the right side with the potential for extreme values on the left side.

October 6, 2023

September 27, 2023

Diabetes dataset mathematical statistics:

From this we can observe that
1: The median value is 8.4, indicating that approximately half of the data points fall below this value, and half fall above it.
2: The mean is 8.7198, which is slightly higher than the median (8.4). This suggests that the data distribution may be positively skewed.
3: A standard deviation of 1.7946 indicates that the data values tend to be relatively close to the mean on average, but there is still some variability.
4: A positive skewness value (0.9744) indicates that the data distribution is positively skewed, with a tail extending to the right. This confirms the earlier observation that the mean is greater than the median.
5: A kurtosis value of 1.0317 suggests that the data distribution has slightly heavier tails than a normal distribution. This indicates a slightly higher likelihood of extreme values compared to a normal distribution.

The above statistics show that the dataset has a positive skewness (right-skewed) and slightly heavier tails than a normal distribution.

September 26, 2023

September 25, 2023

Topics Learnt Today:

1: Bootstrap:

Bootstrap is a resampling technique used for statistical inference, such as estimating the sampling distribution of a statistic or constructing confidence intervals.
It involves repeatedly sampling from the dataset with replacement to create multiple bootstrap samples, each of the same size as the original dataset.
The statistic of interest (e.g., mean, median, standard error) is calculated for each bootstrap sample, and the distribution of these statistics is used to make inferences.
Bootstrap can also be applied to estimate prediction error by resampling the dataset and calculating the error metric (e.g., mean squared error) for each resampled dataset.
Advantages:
- Provides an empirical estimate of the sampling distribution and can be used to construct confidence intervals.
- Useful for making inferences about population parameters and assessing the stability of statistical estimates.
Disadvantages:
- Does not directly provide model evaluation or performance estimation, unlike cross-validation.
- May not be as straightforward for model selection or hyperparameter tuning compared to cross-validation.
In summary, Bootstrap, is used for statistical inference, constructing confidence intervals, and estimating population parameters.

2: Training Error vs Test Error:

Training error and test error are two key concepts in machine learning and model evaluation. They provide insights into how well a machine learning model is performing during training and how it is likely to perform on unseen data. Understanding the differences between these two types of errors is essential for assessing a model’s generalization capability and identifying issues like overfitting or underfitting.

Training Error:

Training error, also known as in-sample error, is the error or loss that a machine learning model incurs on the same dataset that was used to train it.
When you train a model, it learns to fit the training data as closely as possible. The training error measures how well the model fits the training data.
A model that has a low training error is said to have a good fit to the training data. However, a low training error does not necessarily indicate that the model will generalize well to new, unseen data.
Training error tends to be overly optimistic because the model has already seen the training data and has adapted to it, potentially capturing noise and specific patterns that may not generalize to other datasets.

Test Error:

Test error, also known as out-of-sample error or validation error, is the error or loss that a machine learning model incurs on a dataset that it has never seen during training. This dataset is called the validation or test set.
The test error provides an estimate of how well the model is likely to perform on new, unseen data. It helps assess the model’s generalization capability.
A model with a low test error is expected to make accurate predictions on new data, indicating good generalization.
Test error is a more reliable indicator of a model’s performance on real-world data because it measures how well the model can generalize beyond the training data.
In summary, training error measures how well a model fits the training data, while test error provides an estimate of a model’s performance on new, unseen data.

3: Validation-set Approach:

The validation-set approach is a technique used in machine learning and statistical model evaluation to assess a model’s performance and tune its hyperparameters. It’s particularly useful when you have a limited amount of data and want to estimate how well your model will generalize to new, unseen data. Here’s how the validation-set approach works:

Data Splitting: The first step is to divide your dataset into three distinct subsets: a training set, a validation set, and a test set. The typical split ratios are 60-70% for training, 15-20% for validation, and 15-20% for testing, but these ratios can vary depending on the size of your dataset.
- Training Set: This subset is used to train the machine learning model. The model learns patterns and relationships in the data from this set.
- Validation Set: The validation set is used for hyperparameter tuning and model selection. It serves as an independent dataset to evaluate the model’s performance under various hyperparameter settings.
- Test Set: The test set is a completely independent dataset that the model has never seen during training or hyperparameter tuning. It is used to provide an unbiased estimate of the model’s generalization performance.
Model Training and Hyperparameter Tuning: With the training set, you train the machine learning model using various hyperparameter settings. The goal is to find the set of hyperparameters that yields the best performance on the validation set. This process often involves iteratively adjusting hyperparameters and evaluating the model on the validation set until satisfactory performance is achieved.
Model Evaluation: After hyperparameter tuning is complete, you have a final model with the best hyperparameters. You then evaluate this model’s performance on the test set. The test set provides an unbiased estimate of how well the model is likely to perform on new, unseen data.
Performance Metrics: You can use various evaluation metrics, depending on the type of problem you’re addressing. Common metrics include accuracy, precision, recall, F1 score for classification problems, and mean squared error (MSE), root mean squared error (RMSE), or R-squared for regression problems.
Iterative Process: It’s important to note that the validation-set approach can involve an iterative process of model training, hyperparameter tuning, and evaluation. This process helps ensure that the model is well-tuned and performs optimally on unseen data.
Caution: While the validation-set approach is a valuable technique for model evaluation and hyperparameter tuning, it’s essential to avoid data leakage. Data leakage occurs when information from the validation set or test set unintentionally influences the model training process. Ensure that you use the validation set only for tuning hyperparameters and the test set only for final evaluation.

September 26, 2023

September 22, 2023

Topics Learnt Today:

1: Polynomial Regression: Polynomial regression is a type of regression analysis used when the relationship between the independent variable(s) and the dependent variable is not linear but can be approximated by a polynomial function.

Polynomial regression allows for modeling non-linear relationships between variables by introducing higher-order terms (e.g., XSquare, XCube) into the regression equation.

2: Logistic Regression:

Logistic regression is a type of regression analysis used for predicting binary or categorical outcomes. It’s used when the dependent variable is binary (e.g., 0/1, Yes/No, True/False).
The logistic regression model uses the logistic (sigmoid) function to map the linear combination of independent variables to the probability of belonging to one of the categories.

3: Step Function:

A step function, also known as a Heaviside step function or a unit step function, is a mathematical function that returns a constant value (usually 0 or 1) depending on whether its argument is greater than or equal to a threshold value.
It’s often used in engineering and physics to model discontinuous changes or events. In binary classification problems, it’s sometimes used to represent binary outcomes.

4: State Function:

A state function, in the context of state-space models, represents the internal state of a dynamic system. State-space models are commonly used in control theory, engineering, and various scientific fields.
State-space models describe a system using two equations: the state equation (describing how the internal state evolves over time) and the measurement equation (relating the internal state to observed measurements).
In control theory, state functions are used to represent variables such as position, velocity, and acceleration, and they are essential for designing control systems.

September 26, 2023

September 20, 2023

Topics Learnt Today:
1: Predictive Model:

A predictive model is a mathematical or computational representation of a real-world system or phenomenon that is used to make predictions or forecasts about future events or outcomes based on historical data and patterns. Predictive models are a fundamental component of machine learning, data analysis, and statistics, and they find applications in various fields, including finance, healthcare, marketing, and more.

Here are some key aspects and components of predictive models:

Data Collection: Predictive models require historical data to learn from. This data typically includes information about the system being modeled and the outcomes of interest. Data collection can involve various sources, such as sensors, databases, surveys, or web scraping.
Features: Features, also known as predictors or independent variables, are the variables or attributes from the data that the model uses to make predictions. Feature selection and engineering are critical steps in model development to choose the most relevant and informative features.
Target Variable: The target variable, also known as the dependent variable, is the variable the model aims to predict. It represents the outcome or event of interest. For example, in a credit scoring model, the target variable might be whether a person will default on a loan or not.
Model Selection: Choosing an appropriate predictive model is a crucial step. The choice of model depends on the nature of the data (e.g., regression for continuous outcomes, classification for categorical outcomes) and the specific problem being addressed. Common models include linear regression, decision trees, random forests, support vector machines, and neural networks, among others.
Training: Training a predictive model involves using historical data to teach the model how to make predictions. During training, the model learns the relationships between the features and the target variable. The goal is to minimize prediction errors on the training data.
Validation and Testing: After training, the model’s performance is evaluated using validation and testing datasets. Validation helps tune hyperparameters and assess model performance during development, while testing provides an estimate of how well the model will perform on new, unseen data.
Evaluation Metrics: Various evaluation metrics are used to assess the quality of predictions made by the model. Common metrics include accuracy, precision, recall, F1 score, mean squared error (MSE), and root mean squared error (RMSE), depending on the type of problem (classification or regression).
Deployment: Once a predictive model has been trained and tested, it can be deployed in a real-world application. Deployment involves integrating the model into a software system or process to make automated predictions on new data.
Monitoring and Maintenance: Predictive models may require ongoing monitoring and maintenance to ensure they continue to provide accurate predictions. Data drift, changes in the distribution of data, and shifts in the underlying relationships can impact a model’s performance over time.
Retraining: Periodic retraining of the model with updated data is often necessary to maintain its predictive accuracy. Models can become stale if not regularly refreshed with new information.

2: Chi Square Regression: Chi-square regression, also known as Poisson regression or log-linear regression, is a statistical regression model used for analyzing count data or frequency data, where the dependent variable represents counts or occurrences of an event in a fixed unit of observation. This type of regression is particularly suitable when the assumptions of linear regression, such as normally distributed residuals, are not met, and the data exhibit a Poisson or count distribution.

Applications of chi-square regression include analyzing data from fields such as epidemiology (e.g., disease incidence), social sciences (e.g., survey responses), and manufacturing (e.g., defect counts). It is especially useful when dealing with data that exhibit a count distribution, and it provides a way to model and interpret relationships between predictors and counts while accounting for the inherent nature of the data.

September 26, 2023

September 18, 2023

Topics Learnt today:

Overfitting is a common problem in machine learning and statistical modeling. It occurs when a model learns the training data too well, capturing noise or random fluctuations in the data rather than the underlying patterns or relationships. As a result, an overfit model performs very well on the training data but poorly on unseen or new data, which is the ultimate goal of any predictive model. Overfitting is a form of model bias, where the model is too complex for the given data.

Key characteristics and causes of overfitting:

Complex Models: Models that are excessively complex, with too many parameters or features, are prone to overfitting. They have the capacity to fit noise in the data, which leads to poor generalization.
Small Dataset: Overfitting is more likely to occur when you have a small dataset because there is not enough data to capture the true underlying patterns. With limited data, the model may fit the noise instead.
High Model Flexibility: Models with high flexibility or capacity, such as deep neural networks or decision trees with many branches, are susceptible to overfitting. They can adapt too closely to the training data.
Lack of Regularization: Regularization techniques like L1 and L2 regularization or dropout in neural networks are used to control overfitting by adding constraints to the model’s parameters. If these techniques are not used when necessary, overfitting can occur.
Noise in Data: If the training data contains noise or errors, the model might try to fit that noise, leading to overfitting. Clean and well-preprocessed data is important to reduce the risk of overfitting.
Feature Engineering: Including too many irrelevant or redundant features in the model can contribute to overfitting. Feature selection or dimensionality reduction techniques can help mitigate this issue.
Early Stopping: In the training of iterative models (e.g., neural networks), if you train for too many epochs, the model may start overfitting. Early stopping, which involves monitoring the model’s performance on a validation set and stopping training when performance starts to degrade, can help prevent this.
Cross-Validation: Not using cross-validation to assess model performance can lead to overfitting. Cross-validation helps in estimating how well a model will generalize to unseen data.———————————————————————————————–

Cross-validation is a technique used in machine learning and statistics to assess the performance and generalization ability of a predictive model. Its primary purpose is to estimate how well a model will perform on unseen data, which helps in avoiding overfitting (a model that fits the training data too closely but performs poorly on new data) and provides a more accurate evaluation of a model’s capabilities.

Here’s how cross-validation works:

Data Splitting: The first step is to divide the available dataset into two or more subsets: typically, a training set and a testing (or validation) set. The training set is used to train the machine learning model, while the testing set is used to evaluate its performance.
K-Fold Cross-Validation: The most common form of cross-validation is k-fold cross-validation, where the dataset is divided into ‘k’ subsets of approximately equal size. The model is trained and evaluated ‘k’ times, using a different subset as the validation set in each iteration. For example, in 5-fold cross-validation, the dataset is divided into 5 subsets, and the model is trained and tested five times, with each subset serving as the validation set once.
Performance Metrics: In each fold or iteration, the model’s performance is measured using evaluation metrics such as accuracy, precision, recall, F1 score, or mean squared error, depending on the type of problem (classification or regression) you’re solving.
Average Performance: After all k iterations are complete, the performance metrics are averaged across the k folds to obtain a single evaluation score. This score provides an estimate of how well the model is likely to perform on new, unseen data.

Advantages of Cross-Validation:

Robustness: It provides a more robust estimate of a model’s performance because it uses multiple validation sets rather than just one.
Avoiding Overfitting: Cross-validation helps in detecting overfitting because the model is evaluated on different data subsets. If the model performs well across all folds, it’s more likely to generalize well to new data.
Optimal Parameter Tuning: Cross-validation is often used for hyperparameter tuning, allowing you to choose the best set of hyperparameters for your model.

September 16, 2023

September 15, 2023

Topics Learnt Today:

Multi Linear Regression

A dependent variable and two or more independent variables, often known as predictors or explanatory variables, are modelled using the statistical technique known as multiple linear regression (MLR). It is a development of straightforward linear regression, which takes into account just one independent variable. MLR tries to analyse and quantify the links between various independent variables and the dependent variable.

1: Coefficient Interpretation: The coefficients (1, 2, 3, etc.) show how strongly and in which direction each independent variable and the dependent variable are related. For instance, if 1 is positive, it implies, supposing all other variables are constant, that an increase in X1 is related with an increase in Y.

2: Intercept: When all independent variables are zero, the intercept (0) reflects the estimated value of the dependent variable. Depending on the context of your data, this value might not always have a relevant interpretation.

3: Assumptions: Multiple linear regression makes the following assumptions: homoscedasticity, normal distribution, and independence of the residuals (the discrepancies between the actual values of Y and the values predicted by the model). The reliability of the regression results may be impacted by violations of these presumptions.

4: Model Evaluation: To evaluate the goodness of fit of the model and ascertain if it sufficiently explains the variability in the dependent variable, a variety of statistical approaches, including hypothesis testing, R-squared, and adjusted R-squared, can be utilised.

5: Multicollinearity: This phenomenon happens when there is a strong correlation between two or more independent variables in a model. As a result, figuring out the unique contributions of each variable might be difficult. Multicollinearity can be found and addressed using techniques like the variance inflation factor (VIF).

September 14, 2023

September 13, 2023

1: p-value – The p-value is a statistical measure that is commonly used in hypothesis testing to assess the strength of evidence against a null hypothesis.

Small p-value (typically ≤ α): Strong evidence against the null hypothesis. Researchers may conclude that there is a significant effect or difference, supporting the alternative hypothesis.
Large p-value (typically > α): Weak evidence against the null hypothesis. Researchers do not have enough evidence to reject the null hypothesis.2: Breusch-Pagan test – Significance of p-value in this test is as follows:
If p-value ≤ α: This indicates that there is strong evidence to reject the null hypothesis. In other words, you conclude that there is heteroscedasticity in the regression model, suggesting that the variance of the error term is not constant across the levels of the independent variables.
If p-value > α: This suggests that there is not enough evidence to reject the null hypothesis. In this case, you would conclude that there is no significant heteroscedasticity in the regression model, and it is reasonable to assume that the variance of the error term is constant across the levels of the independent variables.
In summary, the p-value in the Breusch-Pagan test helps you assess whether there is heteroscedasticity in your regression model. If the p-value is low (typically less than 0.05), you conclude that there is evidence of heteroscedasticity, which can have implications for the validity of your regression analysis. If the p-value is high, you do not have strong evidence to suggest heteroscedasticity, and you can proceed with more confidence in the assumptions of homoscedasticity.3: Chi-Square Distribution – As the degrees of freedom increase, the chi-square distribution becomes more bell-shaped and approaches a normal distribution (the central limit theorem applies). Chi-square distributions are an essential tool in statistical analysis, particularly for drawing inferences about population variances, testing hypotheses, and assessing relationships between categorical variables.

September 12, 2023

September 11, 2023

Topics learnt in today’s class: (Simple Linear Regression)

1: Skewness – The term “skewness” in simple linear regression describes the asymmetry of the residuals’ distribution, which might have an effect on the reliability of our regression model and how we should interpret its findings. To verify the accuracy of our regression analysis, it’s critical to look for skewness in the residuals and, if necessary, take the proper remedial action.

2: Kurtosis – In simple linear regression, kurtosis describes how the residuals are distributed and might reveal whether they have longer or shorter tails when compared to a normal distribution. When analysing regression data, it is crucial to consider kurtosis since severe kurtosis can affect the validity of our regression results and necessitate corrective actions to assure the accuracy of our research. For a normal distribution kurtosis is 3 but in the diabetes dataset we have the kurtosis values as 4 which is not exactly a normal distribution.

3: Heteroscedasticity – The differences between the observed values of the dependent variable and the values predicted by the regression model is known as heteroscedasticity. In other words, the spread of the residuals changes as you move along the values of the independent variable. Inferences about the statistical significance of the regression coefficients may be made incorrectly as a result of heteroscedasticity. Standard errors, in particular, may be under- or over-estimated, which has an impact on the accuracy of parameter estimations and can result in inaccurate assessments of the significance of predictors. Least squares estimations may no longer be the most effective estimators of the regression coefficients when heteroscedasticity is present. The statistical power of our analysis can be decreased by inefficient estimation.