Hot Spot Analysis Essay, Research Paper
Crime Hotspot Analysis using CrimeStat
Grant Buhay
Introduction:
The concept of Geographical Information Systems and profiling is not new to the
area of Community Policing. Mechanical methods of point pattern analysis and
data separation have been benefiting society ever since Dr. Snow\’s landmark
discovery of the tainted well ( Waters 1995). The analysis of spatial and temporal
data has been critical to the successes of many criminal investigations. Buzzwords such as \”geographic, criminal and psychological profiling\” have rejuvenated an interest in analytical geography and the advances in computer technology have given geographers a medium in which to express the capabilities of their science (Waters, 1999, Rossmo, K., httptp://www.ecricanada.com/AboutECRI.htm). Criminal Geographic Targeting is an example of the practical application of academic research,\” concludes Rossmo, \” and is finally putting geographers on the map.( Rossmo, 1997). The purpose of this paper is to explore the spatial statistics program for the analysis of crime incident locations. CrimeStat , which has been developed by Ned Levine, Ph.D., and Associates, under a grant from the National Institute of Justice. In particular the \”hotspot analysis\” module, which include the Kmeans clustering, nearest neighbor hierarchical spatial clustering and Local Moran statistics will be examined for their functionality, ease of navigation and interpretability of results. The KMeans analysis will be employed to make the determination of whether the distribution of data points display a clustered pattern or one of complete spatial randomness. One would expect the \”hotspot\” aspect of the module to highlight areas of high concentration that may not be apparent on maps that simply plot crime locations. It is in this area of \”hotspot\” identification and analysis that this paper will focus.
Literature Review:
A hot spot has been defined as a condition indicating some form of clustering in a spatial distribution. However, not all clusters are hot spots because the environments that help generate crime, the places where people live, also tend to be in clusters. So any definition of hot spots has to be qualified. Sherman (1995) defined hot spots \”as small places in which the occurrence of crime is so frequent that it is highly predictable, at least over a 1year period.\” According to Sherman, crime and location are approximately six times more concentrated than it is among individuals. Therefore the locational aspect is extremely important.
However, there seems to be confusion surrounding the hot spot issue, especially when it comes to defining the difference between spaces and places. Block and Block (1995) pointed out that a place could be a point, such as an apartment building, or an area, such as a census tract. However, buildings are generally considered places, and census tracts as spaces. Concentrations of criminal activity locations may be easily identified on a relatively simple pointmap of crime locations, however, this becomes problematic when multiple crimes that occur at a single address are displayed by a by a single point on a pin map ( Sadler 1998). So there seems to be some academic debate as to an explicit definition of hot spot except for programs with procedures that selfdefine hot spots, such as, CrimeStat. Hot spots are specific to their local conditions. In Baltimore County, Maryland, for example, hot spots are identified according to three criteria: frequency, geography, and time. At least two crimes of the same type must be present. The area and the timeframe is small a 1 to 2week period. Hot spots are generally monitored by crime analysts until they become inactive (Canter, 1997). Although the definition may be elusive, common sense and objective analysis should clearly define hotspots.
Data Acquisition and Manipulation:
The analysis will be performed on crime data, originally retrieved from Tetrad Computer Applications Inc. Crime Analysis internet site, (http:// www. tetrad. com/new/crime.html#Profile) and was published in the Vancouver Sun, September 16, 1995. The data was in a graphical map format that displayed the locations and modus operandi for unsolved murders in Vancouver spanning twenty four years from 1970 to 1994, figure 1.
Figure 1. Unsolved Murders 19701994
The defined study area is immediately South of Stanley Park in Vancouver and the original data consists of 135 crime locations. The point data displayed on the map has associated to it the particulars of the victim, modus operandi and the actual street address of the crime, table 1. The task of georeferencing the street network map and the crime locations appears, at first glance, deceptively simple. That is, until you try to locate a georeferenced city map of Vancouver, or portion
VICTIM ADDRESS DATE MO SEX AGE
Patricia 1160 Haro St. 19811125 Strangled F 33
Table 1. Victim and Crime Particulars
of, in a digital format without having preapproved financing from a major lending institution. Therefore, it was necessary to take advantage of a corporate demo subject area data base and graphical user interface, referred to currently as; VIP MapGuide, compliments of Canadian Pacific Railway, figure 2.
This intranet site, (http://mgpc/ ctnvip/home.html) is in development to be used to locate and link customer shipments to the rail network and its attributes. This application allowed me to locate and assign the lat / long coordinates associated to each data point in an ASCII text file to be imported into Crime Stat and Idrisi32 for visual analysis. This was a labourious and time consuming task done for each data point. Where very tight clusters of data points were encountered, it was difficult to distinguish individual points. As a result, the data set was reduced to 103 locations, marginally affecting high concentration areas.
Figure 2. Common Track Network VIP Mapguide
Methods:
The nearest neighbor hierarchical spatial clustering routine groups points together on the basis of spatial proximity. The user defines a significance level
associated with a threshold distance, a minimum number of points that are required for each cluster, and an output size for displaying the clusters with ellipses. Clustering is hierarchical in that the firstorder clusters are treated as separate points to be clustered into secondorder clusters, and the secondorder clusters are treated as separate points to be clustered into thirdorder clusters, and so on. Higherorder clusters will be identified only if the distance between their centers are closer than the new threshold distance. The results can be saved to a text file, output as a \’.dbf\’ file, or output as ellipses to ArcView \’.shp\’, MapInfo \’.mif\’ or Atlas*GIS \’.bna\’ files. The cluster output size can be adjusted to display the number of standard deviations defined by the ellipse, from one standard deviation, the default value, to five standard deviations. Defining a minimum number of points that are required can control restrictions on the number of clusters. The default is 10. If there are too few points allowed, then there will be many very small clusters. By increasing the number of required points, the number of clusters will be reduced.
The Kmeans clustering routine is a procedure for partitioning all the points into K groups. Where K is a number assigned by the user. The default K is 5. The routine finds K seed locations in which the distance between points within clusters are small ( minimum within) but the distances between seed locations are large ( maximum between). If K is small, the clusters will typically cover larger areas. Conversely, if K is large, the clusters will typically cover smaller areas. The results can also be saved to a text file, output as a \’.dbf\’ file, or output as ellipses to ArcView \’.shp\’, MapInfo \’.mif\’ or Atlas*GIS \’.bna\’ files.
Method of Analysis Nnh:
The first output to be examined is that of the nearest neighbor hierarchical spatial clustering ( Nnh). In doing so, it is necessary to restate our research objectives by adhering to the following six steps.
Step 1. Hypothesis Statement
The first step is to establish null hypothesis statement regarding the CSR. In this case we can state: Ho; there is no statistically significant difference between the observed and expected values, therefore the distribution of points constitutes a random pattern. Ha; there is a statistically significant difference between the observed and expected values, therefore the distribution of points constitute a clustered pattern.
Step 2. Choice of Test: The statistical choice of test in this case appears to be the ttest, ( Tvalue) although traditional NNA employs a zstatistic( standard normal deviate).
Step 3. Sample Size and Significance Level:
The sample size for all tests regarding this analysis is 103 data points and a significance level of 0.05. The threshold distance is adjusted by the significance level. Distances smaller than the threshold are candidates for clustering. The larger the alphalevel chosen, then clusters will cover larger areas with larger ellipses. The smaller the likelihood, then clusters will cover smaller areas with smaller ellipses. However, the higher the alphalevel chosen, the greater the likelihood that clusters could be chance groupings.
Step 4. Sampling Distribution:
Since we are employing a ttest to determine statistical significance, we can assume we are dealing with a t distribution.
Step 5. Region of Rejection:
Based on our sample size of 103, an alpha value of 0.05, with n1 df and employing a twotailed test, we know from the ttables that the region of rejection is greater than +/ 1.96.
Step 6. Decision Rule:
Based on our hypothesis statement and the region of rejection we must then accept or reject the null hypothesis. If our calculated tvalue is greater than our critical tvalue we must reject the null hypothesis Ho and accept Ha. Since our
Calculated Tvalue of 1.671 is less than our critical Tvalue of 1.96, we cannot reject Ho and therefore state: there is no statistically significant difference between the observed and expected values, therefore the distribution of points constitutes a random pattern.
Results Nnh:
The results from the nearest neighbor hierarchical clustering analysis run in the CrimeStat software are displayed below in table 2.
Nearest Neighbor Hierarchical Clustering
Sample size ………..: 103
Significance level ….: 0.05000 (5.000%)
TLevel ……………: 1.671
Measurement type…….: Direct
Clusters found ……..: 1
Displaying 1 ellipse(s) starting from 1
Order Cluster Mean X Mean Y Rotation XAxis(mi.) YAxis(mi.) Area(sq mi.) Points
—– —— ———— ———— ———— ———— ———— ———— ——
1 1 123.09884 49.28063 19.15483 0.30990 0.20193 0.19660 15
Table 2. Nnh Clustering results
In an effort to make the interpretation more intuitive, the original data point ASCII file was converted to a raster file in Idrisi32, figure 3. By importing the \”shp\” file created by CrimeStat and converting this also to a vector file we are able display the calculated Nnh ellipse overlaid with the original rasterized data points, figure 4. This ellipse is bounded by a single standard deviation , which was defined by the creation parameters.
Figure 3. Raster Image of Unsolved Murders 19701994
Figure 4. Raster Image of Unsolved Murders 19701994
Nnh Clustering Ellipse
Method of Analysis KMeans
The final output to be examined is that of the KMeans Clustering. In doing so, it is necessary to restate our research objectives by once again adhering to the following six steps.
Step 1. Hypothesis Statement
The first step is to establish null hypothesis statement regarding the CSR. In this case we can state: Ho; there is no statistically significant difference between the observed and expected values, therefore the distribution of points constitutes a random pattern. Ha; there is a statistically significant difference between the observed and expected values, therefore the distribution of points constitute a clustered pattern.
Step 2. Choice of Test:
The statistical test in this case is unique to the KMean method of analysis. Although the CrimeStat program does not provide the mathematical foundations of this or any of the other applied methods, one would assume it is based on the calculation of the KFunction statistic, lK.
Step 3. Sample Size and Significance Level:
The sample size for all tests regarding this analysis is 103 data points and a significance level of 0.05.
Step 4. Sampling Distribution:
Once again, we are lacking the necessary information to make a definitive statement on the sampling distribution and assume it is similar in nature to that employed by Chen and Getis in their Kfunction Analysis module of the PPA program they have developed. They had assumed stationarity, which is necessary if inferences are to be made from a single observed pattern. This testing procedure and its assumptions is referred to as a homogeneous Poisson process.
Step 5. Region of Rejection:
The rejection region is defined by the minimum and maximum expected values. If the observed values fall outside of this region then we must reject Ho. If the observed value falls within this region then we cannot reject Ho and must accept Ha based on a 95% level of confidence.
Step 6. Decision Rule:
Based on our hypothesis statement and the region of rejection we must then accept or reject the null hypothesis. In this case, the value itself, when compared to the region of rejection (either above or below) can provide further information regarding the distribution of our point data.
KMean Results
The results for the KMeans Clustering analysis are presented below in table 3.
KMeans Clustering:
Sample size …….: 103
Clusters ……….: 5
Iterations ……..: 2
Cluster Mean X Mean Y Rotation XAxis(mi.) YAxis(mi.) Area(sq mi.) Points
—— ———— ———— ———— ———— ———— ———— ——
1 123.09023 49.27834 86.51926 0.83148 1.16434 3.04148 49
2 123.13010 49.28052 3.28745 1.45250 1.22886 5.60750 35
3 123.13898 49.20872 14.65336 0.04971 1.40915 0.22005 3
4 123.07925 49.22559 39.79857 1.82892 0.71546 4.11085 8
5 123.03397 49.23493 83.19329 1.31585 1.10806 4.58057 8
Table 3. KMeans Clustering Results
Once again, the ellipse files were created with a bound of one standard deviation. These \”.shp\” files imported into to Idrisi32 and overlaid onto the original data points, figure 5, for ease of interpretation
Figure 5. Raster Image of Unsolved Murders 19701994
KMeans Ellipses
Analysis of Results:
Although the raster images created by this procedure are visually appealing, the
results are still ambiguous. The Kmeans clustering routine is a procedure for partitioning all the points into K groups. Where K is a number assigned by the user. In this case the default number of 5 was accepted. Five ellipses were created but the results for cluster 1 and 2 have a significant population of data points, 49 and 35 respectively. The threshold that determines whether an ellipse is significantly clustered or CSR is not readily apparent. It appears however, that clusters 3, 4,and 5 indicate that there is no statistically significant difference between the observed and expected values, and they have data population point counts of only 3,8 and 8 respectively . Therefore we must accept Ho and state; the distribution of points constitutes a random pattern. The same cannot be said for clusters 1 and 2, which have data point populations of 49 and 35 respectively. This would suggest that there is a statistically significant difference between the observed and expected values, therefore we must reject Ho and state; the distribution of points constitutes a clustered pattern for these clusters. These two areas then, would be classified as \”hot spots\”, areas of concentrated crime. This may be deceiving as, as discussed previously, multiple crimes at one location can lead to misleading results. This could range from an apartment complex to a vacant industrial site. Analysis of this type must be \”ground truthed\” and inferences should not be made with a priori intimate knowledge of the study area.
CrimeStat Analysis :
The spatial statistics program for the analysis of crime incident locations, CrimeStat has eight functional tabs, four of which are dedicated to data structures and parameters; primary file, secondary file, reference file and measurement parameters. The initial data input screen can be troublesome as the \”compute\” button did little more than freeze the program. Three of the remaining four tabs; distance analysis, Hot Spot analysis and Interpolation provide a powerful analytical arsenal. The Interpolation module, which calculates probabilities and densities and can handle several layers of different distributions would be very useful in timeseries and change detection analyses. Unfortunately the spatial Autocorrelation indices module, which houses Moran\’s I and Geary\’s C, was completely inaccessible, with dialogue boxes \”greyed out\” entirely. In general, the CrimeStat program is user friendly, but requires an in depth knowledge of the subtleties of several similar operations and the lack of mathematical documentation makes this task much more difficult.
Conclusions:
The often complicated and important process of data acquisition, rectification and preparation for implementation into a GIS software system is often underestimated. A common source of digital data, provided by projects like TIGER are long overdue, and could provide \”customers\” and users of this data warehouse with a consistent and reliable single source database. This is not to say a single source database may not be erroneous, but the error is kept constant if all are accessing the same information. The purpose of this paper was to explore the spatial statistics program for the analysis of crime incident locations, CrimeStat. The \”hotspot analysis\” module, which included the Kmeans clustering and the nearest neighbor hierarchical spatial clustering methods were both successful in processing the inputted data effectively and efficiently. The nearest neighbor hierarchical spatial clustering routine found the overall distribution of data points to be distributed in a completely spatially random pattern. The KMeans analysis was employed to refine the Nnh process by categorizing the data points into five ellipses, based on the standard deviation of the data in both the major and minor axes. Two of these ellipses, ellipse 1 and 2, were found to be statistically significant. The \”hotspot\” aspect of the program did indeed highlight areas of higher concentration and is a useful identification and monitoring tool. The CrimeStat program and programs like it, provide users with the ability to create intelligent maps that allow us to extrapolate information into predictive models. This \”modular approach\” of expanding and developing specific software supplements, empowers the user by providing new applications and dynamic methodologies, to the static and expensive cartographic / GIS software nuclei ( Anselin, L., 1998). The strength of these analyses cannot stand alone and requires an intelligent user to make intelligent choices to provide meaningful results.
32a
