December 8, 2023
Linear Regression:
Model Evaluation of Train Data:
Scatter Plot of Predicted vs Actual Values:
• scatter plot with the x-axis representing the actual values (y_train) and the y-axis representing the predicted values (y_pred).
• The closer the points are to the diagonal line (not explicitly shown but implied), the better the model’s predictions match the actual data.
• The points seem to align well along an increasing diagonal line, suggesting a good fit between the model’s predictions and actual values, especially as the total jobs number increases.
Distribution of Residuals:
• The second image is a histogram overlaid with a kernel density estimate that shows the distribution of the model’s residuals, which are the differences between the actual values and the predicted values.
• Ideally, the residuals should be normally distributed around zero, indicating that the model’s predictions are unbiased.
• The distribution looks approximately normal and centered around zero, which is a good sign, although there seems to be a slight right skew.
Model Evaluation of Validation Data:
Scatter Plot of Predicted vs Actual Values:
This plot compares the actual values (y-val) on the x-axis with the predicted values on the y-axis. Ideally, if predictions were perfect, all points would lie on the diagonal line which equals the predictions. The scatter shows that the model’s predictions are reasonably close to the actual values, although there is some variance, especially in the middle range of the actual values.
Distribution of Residuals:
The residuals are the differences between the actual and predicted values. This histogram shows the distribution of these residuals, with a superimposed kernel density estimate (KDE). The residuals seem to be approximately normally distributed, with a mean close to zero. This is a good sign, indicating that the model does not systematically overpredict or underpredict the total number of jobs. However, there is a noticeable spread, suggesting that there are predictions that are significantly off from the actual values, which is also reflected in the scatter plot.
December 6, 2023
The pairplot shows the relationships and distributions of different economic indicators within clusters determined by KMeans clustering. Each scatter plot’s axes represent two indicators, while the density plots on the diagonal show the distribution of single variables, colored by cluster.
Analysis:
• Clusters are formed based on the inherent groupings in the multidimensional data. Each cluster represents a grouping of data points that are like each other.
• The scatter plots demonstrate how these clusters are distributed with respect to two indicators at a time.
• The distribution plots (on the diagonal) indicate the range and density of each variable within each cluster, with some variables showing distinct peaks for different clusters.
From the clusters, we can infer that certain combinations of economic indicators are common suggesting potential correlations or influences between these indicators
December 4, 2023
Total Jobs vs Logan Passengers:
There is a positive relationship between the number of passengers at Logan Airport and the total number of jobs.
The R-Squared value is approximately 0.729, suggesting that about 72.9% of the variability in total jobs can be explained by the number of Logan passengers.
The p-value is extremely low (approximately 3.57×10-15), indicating a statistically significant relationship.
Total Jobs vs Logan International Flights:
Similarly, the number of international flights has a positive correlation with the total number of jobs.
The R-Squared value is 0.764, meaning that approximately 76.4% of the variability in total jobs is accounted for by the number of international flights.
The p-value is very small (around 3.04×10-17), which implies a statistically significant relationship.
Total Jobs vs Hotel Occupancy Rate:
The relationship between hotel occupancy rates and total jobs is weaker compared to the previous two variables.
The R-Squared value is about 0.142, indicating that only 14.2% of the variability in total jobs is explained
by the hotel occupancy rate.
The p-value is approximately 0.197, which is above the typical significance level of 0.05, suggesting that the relationship might not be statistically significant.
Total Jobs vs Hotel Average Daily Rate:
There is a moderate positive relationship between the average daily rate of hotels and total jobs.
The R-Squared value is 0.313, which means that about 31.3% of the variability in total jobs can be explained by the hotel average daily rate.
The p-value is approximately 0.0038, indicating a statistically significant relationship at common significance levels.
Total Jobs vs Unemployment Rate:
There is a strong negative relationship between the unemployment rate and the total number of jobs, which is intuitive as higher unemployment would typically be associated with fewer jobs.
The R-Squared value is about 0.872, suggesting that 87.2% of the variability in total jobs can be explained by the unemployment rate.
The p-value is extremely low (around 4.10 x 10-27), indicating a very strong statistically significant relationship.
December 1, 2023
Box Plots:
- Hotel Occupancy Rate: The median hotel occupancy rate in the Boston area is between 72.5% and 77.5%. The spread of the data is smaller than that of the number of passengers and international flights, with most months having an occupancy rate between 65% and 85%.
- Hotel Avg Daily Rate: The median hotel average daily rate in the Boston area is between $240 and $265 per night. The spread of the data is larger than that of the hotel occupancy rate, with some nights having rates as low as $200 and others having rates as high as $300.
- Hotel occupancy rates are also relatively stable, but there is a wider range of possible rates.
- Hotel average daily rates vary more significantly than any of the other variables.