October 30, 2023

Topics Learnt Today:

  • White Mean Age: The White group’s mean age is roughly 40.09 years old.
  • Black Mean Age: The Black population is roughly 32.74 years old on average.
  • Mean Difference: The two groups’ mean ages differ by an absolute 7.35 years (Mean Age White – Mean Age Black).
  • T-Statistic: The mean difference from zero is expressed as a number of standard deviations. The T-Statistic in this instance is 18.46, suggesting a significant difference in the mean ages of the two groups.
  • P-Value: Based on the assumption that there is no difference between the groups (the null hypothesis), this is the likelihood of obtaining a T-Statistic as extreme as the one observed. Strong evidence against the null hypothesis is shown by a very low P-Value of 1.797e-72, or 1.797 x 10^(-72). Stated differently, it seems improbable that the observed variation in mean ages is the result of pure chance.
  • More or Equal Simulated Mean Differences: There seems to be a connection to simulations. Researchers occasionally utilize simulations in hypothesis testing to produce a range of test statistics presuming the validity of the null hypothesis. Since this number is 0, it’s possible that none of the simulated mean differences exceeded or were equal to the mean difference that was observed.
  • Total Simulations: The total number of simulations run, in this example 2,000,000.

In conclusion, the evidence to reject the null hypothesis is that there is no difference in the mean ages between the White and Black groups—based on the low P-Value and the T-Statistic. The information indicates that there is a statistically significant difference in the two groups’ mean ages, with the White group’s mean age being greater than the Black group’s.

October 27, 2023

Topics Learnt Today:

1: One-Tailed Test:

  • For the scenario:  Average age in State A is higher than the national average age.
  • Null Hypothesis (H0): State A’s average age is either the same as or less than the average age of the country.
  • The alternative hypothesis (H1) is that State A’s average age is noticeably older than the average age of the country.
  • Average age in State A is statistically higher than the national average in this one-tail (right-tail) test.

2: Two-Tailed Test:

  • For the scenario: Number of fatal police shootings during different months of the year in a city within the US.
  • The number of gunshot occurrences in a given month is equal to the annual average, according to the null hypothesis (H0).
  • The number of gunshot occurrences in at least one month differs significantly from the annual average, according to the alternative hypothesis (H1).
  • Regardless of whether the monthly shooting event count is higher or lower than the annual average, there is a substantial deviation from the norm using this two-tail test.

October 23, 2023

Topics Learnt Today:

K-medoids:

  • Strengths:
    K-medoids, also known as PAM (Partitioning Around Medoids), is a more robust alternative to K-means. It identifies clusters based on representative points called medoids. These medoids are actual data points within the dataset, making them more suitable for handling irregularly shaped clusters.
    Unlike K-means, K-medoids is less sensitive to outliers, making it a better choice when dealing with data containing noise or extreme values.
  • Weaknesses:
    While K-medoids is more versatile than K-means in terms of cluster shape, it can still struggle with very large datasets due to its computational complexity.

October 20, 2023

Topics Learnt Today:

1: K-Means:

  • Strengths:
    K-means excels at identifying clusters of data points that are close to each other spatially. It operates by partitioning data into a predetermined number of clusters, with each cluster having a centroid point. The algorithm groups data points based on their proximity to these centroids. It is particularly effective when dealing with clusters of relatively uniform size and shape, making it suitable for situations where clusters are spherical or have similar geometries.
  • Weaknesses:
    K-means struggles with irregularly shaped clusters, as it assumes that clusters are spherical and uniform in size. When clusters are elongated or have varying sizes, K-means may produce suboptimal results, leading to the mixing of data points between clusters.

2: DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

Strengths:
DBSCAN excels at finding clusters of varying shapes and sizes. It identifies clusters based on the density of data points, allowing it to uncover clusters of different geometries.
DBSCAN is particularly adept at identifying outliers within the dataset, as it doesn’t force data points into clusters if they don’t meet density criteria.
Weaknesses:
DBSCAN may require careful parameter tuning, such as the radius of the neighborhood around each point, to yield optimal results. In some cases, inappropriate parameter choices can lead to under-segmentation or over-segmentation of the data.

 

October 18, 2023

Topics Learnt today:

1: Exploratory Data Analysis:

  • The dataset from The Washington Post data repository, specifically focuses on fatal police shootings in the United States.
  • This exploration involved an in-depth analysis of the dataset’s structure, its columns, and the types of data they contained. The primary goal was to gain a clear understanding of how the dataset was organized.
  • Identification of missing data or outliers within the dataset is done. Addressed missing values and outliers in the dataset for accurate and reliable analysis.

2: Data Cleaning:

  • In the data cleaning process, I focused mainly on the variables  age, race, and flee. The missing data in these variables is imputed by values when suitable, ensuring that incomplete records were handled appropriately.
  • The outliers is being dealt with to prevent extreme values from negatively impacting the results of their analysis. By addressing missing data and handling outliers, the dataset’s quality is enhanced, making it more suitable for accurate and reliable analysis.

October 16, 2023

Topics Learnt Today:

1: Clustering algorithms are employed to group similar incidents, such as fatal police shootings, based on their spatial proximity.

  • Clustering refers to the process of identifying groups of data points that are located close to each other in space.
  • The purpose of clustering in this scenario is to detect patterns or trends in the spatial distribution of these incidents.
  • It shows us that there are specific geographic areas with a higher incidence of fatal police shootings compared to others.
  • Clustering can provide valuable insights into the spatial relationships between incidents, potentially aiding law enforcement agencies, policymakers, and researchers in understanding the underlying factors contributing to the occurrences.

2: The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm is mentioned as a specific method for clustering the data related to fatal police shootings.

  • DBSCAN is a popular clustering algorithm that can identify clusters of data points in a dataset while also recognizing and labeling noise points. By applying DBSCAN to the data with carefully chosen parameters, it will group data points into clusters based on their spatial proximity.
  • Data points that do not belong to any cluster are considered noise points. This process assists in discerning geographic regions with a higher density of incidents and isolating areas with fewer occurrences.
  • DBSCAN is known for its ability to handle clusters of various shapes and sizes, making it suitable for spatial analysis of incidents like fatal police shootings. The algorithm is valuable for identifying patterns and trends in the data and can provide a deeper understanding of the spatial distribution of these incidents.

October 13, 2023

Topics learnt today:
1: Creating a “Point” Object with the Geopy Library:

  • I’ve defined the latitude and longitude values for a specific location and created a Point object by passing these values to the Point() constructor after which, I accessed the latitude and longitude of the Point object using the latitude and longitude attributes.
  • The purpose of this is often to work with geospatial data in various data analysis or mapping tasks.

2: Geographical heat maps serve as a valuable tool for visualizing data related to fatal police shootings.

  • The histogram and the smooth histogram show where these incidents occur with greater frequency.
  • This visualization technique allows individuals to identify areas with a higher concentration of fatal police shootings, providing a clear and intuitive picture of the regions where the problem is most significant.
  • Geographical heat maps often use color gradients to represent the density of incidents, with darker colors indicating higher concentrations

October 11, 2023

Topics Learnt Today:
1: Formula for calculating the distance between two latitude and longitude points: distance = geodesic(point1, point2).kilometers

  • This formula is valuable for establishing the foundation of clustering procedures, particularly when dealing with geographic data.
  • It leverages the geodesic function to compute the distance between two geographical coordinates in kilometers.
  • Additionally, it is worth noting that by replacing .kilometers with .miles, the distance can be calculated in miles, providing flexibility in unit selection for distance measurement.

2: An alternative statistical approach to comparing factors such as age and race is ANOVA test.

  • Instead of conducting 15 separate t-tests, which could be inefficient and cumbersome when examining multiple variables, we can use an Analysis of Variance (ANOVA) test.
  • ANOVA offers a more comprehensive and efficient way to assess the relationships between multiple factors simultaneously.
  • This approach can provide a holistic view of the impact of these factors on the data, avoiding the need for multiple separate tests and facilitating a better understanding of their combined effects.

October 6, 2023

Relationship between %Inactive, %Obese, %Diabetic using Pearson correlation:

Pearson Correlation Coefficient between % INACTIVE and % OBESE: 0.47265609987121526

  • The positive sign of the correlation coefficient (0.4727) indicates a positive linear relationship between the two variables, % INACTIVE and % OBESE. This means that as one variable increases, the other tends to increase as well, and vice versa.
  • The value of 0.4727 suggests a moderate positive correlation. It is not extremely close to 1, which would indicate a perfect positive linear relationship, nor is it close to 0, which would suggest no linear relationship. Instead, it falls in the range between 0 and 1, indicating a moderate strength of association.
  • This positive correlation suggests that regions or data points with higher percentages of physical inactivity (% INACTIVE) tend to also have higher percentages of obesity (% OBESE). Conversely, regions with lower physical inactivity percentages tend to have lower obesity percentages.
  • Pearson Correlation Coefficient of approximately 0.4727 between % INACTIVE and % OBESE indicates a moderate positive linear relationship between physical inactivity and obesity

 

October 4, 2023

Mean comparison of inactivity and obesity:

Points observed:
1: Obesity rates are higher than physical inactivity rates.
2: Obesity rate has more substantial impact on the dataset than physical inactivity rate.

Histogram of %inactive and %obese

Points Observed:
1: Obesity and Inactivity have different trends but have some correlation between them.
2: Inactivity has a wider spread of the data which mean that it has higher variability.
3: Obesity has comparatively less spread of the data which means it has less variability.

October 2, 2023

Inactivity dataset mathematical statistics:

From the inactivity dataset statistics we can observe that:
1: The median value is 16.7, indicating that approximately half of the data points fall below this value, and half fall above it.
2: The mean is slightly lower than the median (16.7), suggesting a slight negative skewness.
3: A standard deviation of 1.9253 indicates that the data values have moderate variability around the mean.
4: A negative skewness value (-0.3420) indicates that the data distribution is slightly left-skewed, with a tail extending to the left. This suggests that there may be a slight concentration of data points on the right side of the distribution, toward higher values.
5: A negative kurtosis value (-0.5490) suggests that the data distribution has lighter tails than a normal distribution (platykurtic). This indicates a lower likelihood of extreme values compared to a normal distribution.

The inactivity dataset statistics has a slight negative skewness (left-skewed), where the mean is slightly lower than the median. The standard deviation indicates moderate variability in the data, and the negative kurtosis suggests that the distribution has lighter tails compared to a normal distribution, indicating a lower likelihood of extreme values. The overall shape of the distribution is relatively close to a normal distribution.

September 29, 2023

Obesity dataset mathematical statistics:

From the statistics we can observe that:
1: The median value is 18.3, indicating that approximately half of the data points fall below this value, and half fall above it.

2: The mean is very close to the median (18.3), indicating that the data distribution is approximately symmetric.

3: A standard deviation of 1.0369 indicates that the data values are relatively close to the mean, with low variability.

4: A negative skewness value (-2.6851) indicates that the data distribution is strongly negatively skewed (left-skewed), with a tail extending to the left. This suggests that most of the data points are concentrated on the right side of the distribution, toward lower values.

5: A kurtosis value of 12.3225 suggests that the data distribution has very heavy tails compared to a normal distribution (very leptokurtic). This indicates a high likelihood of extreme values compared to a normal distribution.

The obesity dataset statistics describe a strong negative skewness (left-skewed) and very heavy tails. The mean is close to the median, suggesting approximate symmetry, but the strong negative skewness and high kurtosis indicate that the data distribution has a significant concentration of values on the right side with the potential for extreme values on the left side.

September 27, 2023

Diabetes dataset mathematical statistics:

From this we can observe that
1: The median value is 8.4, indicating that approximately half of the data points fall below this value, and half fall above it.
2: The mean is 8.7198, which is slightly higher than the median (8.4). This suggests that the data distribution may be positively skewed.
3: A standard deviation of 1.7946 indicates that the data values tend to be relatively close to the mean on average, but there is still some variability.
4: A positive skewness value (0.9744) indicates that the data distribution is positively skewed, with a tail extending to the right. This confirms the earlier observation that the mean is greater than the median.
5: A kurtosis value of 1.0317 suggests that the data distribution has slightly heavier tails than a normal distribution. This indicates a slightly higher likelihood of extreme values compared to a normal distribution.

The above statistics show that the dataset has a positive skewness (right-skewed) and slightly heavier tails than a normal distribution.