Quantifying Spatial Autocorrelation with Moran’s I

Quantifying spatial autocorrelation with Moran’s I
Python
Geospatial
Published

October 20, 2024

Spatial autocorrelation is a statistical concept that measures how similar values of a variable are based on their proximity to each other on a map. Positive spatial autocorrelation means that geographically nearby values of a variable tend to be similar on a map: high values tend to be located near high values, and low values near low values. By analyzing the spatial distribution of potential hazards, insurers can identify patterns and trends that might not be apparent when looking at data in isolation. This allows for more accurate risk assessment and pricing, ensuring that premiums are set appropriately based on the actual risk in different geospatial cohorts. Additionally, understanding geospatial correlations can help insurers develop targeted strategies for risk mitigation and improve their overall portfolio management. This approach ultimately leads to better financial stability for the insurer and fairer pricing for policyholders.

It is well known that tornadoes frequently occur in “Tornado Alley”, a loosely defined location of states including Texas, Louisiana, Oklahoma, Kansas, South Dakota, Iowa and Nebraska. In what follows, we analyze tornado events from 2008-Present and estimate the global spatial autocorrelation for these events using Moran’s I, and determie if the results align with our prior assumptions of positive spatial autocorrelation.

The National Oceanic and Atmospheric Administration (NOAA) documents the occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce. The database for events going back to 1950 is available for download here, and serves as the starting point for our analysis. U.S. state shapefiles are available here.


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd

np.set_printoptions(suppress=True, precision=5)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Storm data. 
noaa_path = "noaa-storm-data-1950-2024.csv.gz"

# US state shapefiles. 
state_shp = "tl_2024_us_state.zip"


We begin by reading in the the NOAA storm event data and filter down to tornado events from 2008-present. We also remove lower severity tornado events based on the Enhanced Fujita Scale:

Enhanced Fujita scale:

Events with TOR_F_SCALE = “EFU”, “EF0”, “EF1” are excluded.


# Obtained here: https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/
dfstorms_all = pd.read_csv(noaa_path, low_memory=False)

keep_events = [
    "F2", "EF2", "F3", "F4", "F5", "EF3", "EF4", "EF5"
]

dfstorms = dfstorms_all[
    (dfstorms_all.YEAR>=2008) & 
    (dfstorms_all.EVENT_TYPE=="Tornado") &
    (dfstorms_all.TOR_F_SCALE.isin(keep_events))
]

dfstorms = (
    dfstorms
    .dropna(subset=["BEGIN_LAT", "BEGIN_LON", "TOR_F_SCALE"])
    .reset_index(drop=True)
)

# Filter storms to lower 48 states.
storms = gpd.GeoDataFrame(
    dfstorms,
    geometry=gpd.points_from_xy(dfstorms.BEGIN_LON, dfstorms.BEGIN_LAT),
    crs="EPSG:4326"
)

storms = storms[["EVENT_ID", "TOR_F_SCALE", "geometry"]]
storms = storms.cx[-125:65, 24.6:50].reset_index(drop=True)

print(f"Number of storms: {storms.shape}")

storms.head()
Number of storms: (2917, 3)
EVENT_ID TOR_F_SCALE geometry
0 105611 EF2 POINT (-96.30000 43.14210)
1 80215 EF2 POINT (-86.01250 37.66250)
2 80216 EF2 POINT (-85.77590 37.69650)
3 80218 EF2 POINT (-85.06490 37.68890)
4 79065 EF2 POINT (-85.38120 37.81130)


Reviewing the distribution EF events:


storms.TOR_F_SCALE.value_counts()
TOR_F_SCALE
EF2    2197
EF3     579
EF4     128
EF5      13
Name: count, dtype: int64


Load the state shapefile and overlay retained tornado events:


import os
import sys

# This suppresses `Cannot find header.dxf (GDAL_DATA is not defined)` warning.
os.environ['GDAL_DATA'] = os.path.join(f'{os.sep}'.join(sys.executable.split(os.sep)[:-1]), 'Library', 'share', 'gdal')

# Filter to lower 48 states. 
states = gpd.read_file(state_shp).to_crs("EPSG:4326")[["NAME", "ALAND", "geometry"]]
states = states.cx[-125:65, 24.6:50].reset_index(drop=True)

fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("U.S. Tornadoes ex. EF0, EF1: 2008-Present", fontsize=10)
states.boundary.plot(ax=ax, edgecolor="grey", linewidth=.35)
storms.plot(ax=ax, color="blue", marker="x", markersize=25, alpha=.35)
ax.axis("off")
plt.show()


There are more tornadoes in the mid-south and southeast than expected. Nonetheless, there exists a clear spatial pattern to the events: Tornadoes do not appear randomly distributed over the lower 48 states. We can use Global Moran’s I to gauge the degree of spatial autocorrelation inherent in the data.

In order to assign each tornado to a state, a spatial join is performed where each event is assigned to the state polygon that encloses it:


dfn = (
    states
    .sjoin(storms)
    .groupby(["NAME", "ALAND"], as_index=False).size()
    .rename({"size": "n"}, axis=1)
)

# Merge dfn with states to get back geometry associated with each state.
# gdf will be a GeoDataFrame.
gdf = (
    states
    .merge(dfn, on=["NAME", "ALAND"], how="left")
    .fillna(value={"n": 0})[["NAME", "ALAND", "n", "geometry"]]
    .sort_values("NAME")
    .reset_index(drop=True)
)

print(f"Total number of events: {gdf['n'].sum()}")

gdf.head(20)
Total number of events: 2917.0
NAME ALAND n geometry
0 Alabama 131185561946 253.0 POLYGON ((-85.12733 31.76256, -85.12753 31.762...
1 Arizona 294366118294 7.0 POLYGON ((-110.75069 37.00301, -110.74193 37.0...
2 Arkansas 134658517854 133.0 POLYGON ((-90.95577 34.11871, -90.95451 34.117...
3 California 403673433805 3.0 MULTIPOLYGON (((-119.99987 41.18397, -119.9998...
4 Colorado 268418973518 31.0 POLYGON ((-105.15504 36.99526, -105.15543 36.9...
5 Connecticut 12541999507 0.0 POLYGON ((-72.52790 41.17777, -72.55156 41.173...
6 Delaware 5046692239 4.0 POLYGON ((-75.50949 39.68652, -75.50942 39.686...
7 District of Columbia 158316181 0.0 POLYGON ((-77.11975 38.93435, -77.11886 38.935...
8 Florida 138965379385 40.0 MULTIPOLYGON (((-83.10874 24.62949, -83.10711 ...
9 Georgia 149485762701 153.0 POLYGON ((-81.09538 31.52098, -81.09710 31.519...
10 Idaho 214050504522 0.0 POLYGON ((-116.35844 49.00047, -116.34805 49.0...
11 Illinois 143778206717 140.0 POLYGON ((-87.89243 38.28285, -87.89334 38.282...
12 Indiana 92786694938 96.0 POLYGON ((-86.32960 38.18180, -86.33037 38.182...
13 Iowa 144660344616 123.0 POLYGON ((-95.86095 43.50004, -95.85979 43.500...
14 Kansas 211753821560 129.0 POLYGON ((-94.61764 37.76124, -94.61766 37.760...
15 Kentucky 102266755818 118.0 MULTIPOLYGON (((-89.57120 36.55214, -89.57114 ...
16 Louisiana 111920866908 122.0 POLYGON ((-92.72474 33.01439, -92.72365 33.014...
17 Maine 79888396620 0.0 POLYGON ((-67.80569 45.68005, -67.80363 45.677...
18 Maryland 25151223822 5.0 POLYGON ((-75.75600 39.24607, -75.75579 39.243...
19 Massachusetts 20204400635 3.0 POLYGON ((-71.25515 42.73655, -71.25510 42.736...


Visualizing the number of tornadoes by state:


cmap = "turbo"

fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("U.S. Tornadoes by State ex. EF0, EF1: 2000-Present", fontsize=9)
gdf.plot(
    ax=ax, column="n", edgecolor="gray", linewidth=.50, 
    cmap=cmap, alpha=.750, legend=True,
    legend_kwds={"label": "Nbr. Tornadoes", "orientation": "vertical", "shrink": .35}
    )
ax.axis("off")
plt.show()


Texas appears to have a large number of events, but this may be due to the fact that Texas has the largest land area of any state in the lower 48. If tornadoes were randomly distributed, we wouldn’t expect to see clusters of similar values on the map. However, our attention is drawn to south central and south eastern states, where hues are brighter compared with coastal regions.

Let’s take a look at number of events divided by ALAND:


cmap = "turbo"

gdf["n_div_ALAND"] = gdf["n"] / gdf["ALAND"]

fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("U.S. tornadoes by state / land area ex. F0, EF0, F1, EF1: 2000-Present", fontsize=9)
gdf.plot(
    ax=ax, column="n_div_ALAND", edgecolor="gray", linewidth=.50, 
    cmap=cmap, alpha=.750, legend=True,
    legend_kwds={"label": "Nbr. Tornadoes", "orientation": "vertical", "shrink": .35}
    )
ax.axis("off")
plt.show()



Mississippi and Alabama are consistent with the raw counts choropleth, but Texas doesn’t stand out as much when normalized by total land area. Nonetheless, raw tornado counts are used for the subsequent analysis.


Spatial Weights

In order to quantify spatial autocorrelation, it is first necessary to define the idea of closeness or proximity between geospatial units. The spatial weights matrix \(W\) encodes the spatial relationships between different locations in a dataset in much the same way as an adjacency matrix used in Graph Theory and Network Science. For a dataset with \(N\) geospatial units, \(W\) will have dimension \(N \times N\), with 0’s along the main diagonal. Each element \(w_{ij}\) represents the spatial relationship between locations \(i\) and \(j\). The relationship can be based on distance, contiguity or some other criteria. More information of spatial weights matrices is available here.

We use the Queen contiguity matrix, which is a row-standardized, symmetric matrix that indicates whether two areas are neighbors based on whether they share a vertex or edge. The value at cell \(w_{i,j}\) is 1 if \(i\) and \(j\) are neighbors and 0 otherwise. Using PySAL, the weights matrix with Queen contiguity can be created as follows:


from pysal.lib import weights

# Queen contiguity.
wq = weights.contiguity.Queen.from_dataframe(gdf, use_index=True)

# Convert to dense matrix for viewing:
W = pd.DataFrame(*wq.full()).astype(int)

W
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
8 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1
11 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
14 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0
16 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0
19 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
20 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
21 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
22 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
23 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
25 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
26 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
28 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
29 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0
30 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0
31 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0
32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
33 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
34 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
35 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
36 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
37 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
38 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
39 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
40 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
41 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
42 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
44 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
45 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0
47 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
48 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0


The number of neighbors for each state is available in wq’s cardinalities attribute:


wq.cardinalities
{0: 4,
 1: 5,
 2: 6,
 3: 3,
 4: 7,
 5: 3,
 6: 3,
 7: 2,
 8: 2,
 9: 5,
 10: 6,
 11: 6,
 12: 4,
 13: 6,
 14: 4,
 15: 7,
 16: 3,
 17: 1,
 18: 5,
 19: 5,
 20: 5,
 21: 5,
 22: 4,
 23: 8,
 24: 4,
 25: 6,
 26: 5,
 27: 3,
 28: 3,
 29: 5,
 30: 6,
 31: 4,
 32: 3,
 33: 5,
 34: 6,
 35: 4,
 36: 6,
 37: 3,
 38: 2,
 39: 6,
 40: 8,
 41: 4,
 42: 6,
 43: 3,
 44: 6,
 45: 2,
 46: 5,
 47: 4,
 48: 6}


gdf is sorted by NAME, so index 0 represents Alabama, index 1 = Arizona, etc. Iowa is at index 13. The columns with value 1 for row index 13 represent Iowa’s neighbors in terms of Queen contiguity:


iowa = W.loc[13]

iowa_neighbor_indices = iowa[iowa==1].index

iowa_neighbor_names = gdf.loc[iowa_neighbor_indices, "NAME"]

iowa_neighbor_names
11        Illinois
21       Minnesota
23        Missouri
25        Nebraska
39    South Dakota
47       Wisconsin
Name: NAME, dtype: object


Spatial lag calculates the average number of tornadoes for a given state’s neighbors, and assigns the result to the state:


# Row standardized weights.
wq.transform = "r"

lag = weights.lag_spatial(wq, gdf["n"])

gdf["lag"] = lag


For Iowa, the spatial lag is 79.83. If we compute the average number of tornadoes from Illinois, Minnesota, Missouri, Nebraska, South Dakota and Wisconsin, we should get 79.83:


actual = gdf.loc[iowa_neighbor_indices, "n"].mean()

print(f"Iowa spatial lag - actual / expected: {actual:.2f} / 79.83")
Iowa spatial lag - actual / expected: 79.83 / 79.83


Visualizing the spatial lag can provide further insight into the underlying process. It tends to smooth spatial data, making it easier to identify patterns and trends:


cmap = "turbo"

fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("Spatial lag", fontsize=9)
gdf.plot(
    ax=ax, column="lag", edgecolor="gray", linewidth=.50, 
    cmap=cmap, alpha=.750, legend=True,
    legend_kwds={"label": "Nbr. Tornadoes", "orientation": "vertical", "shrink": .35}
    )
ax.axis("off")
plt.show()


Global Moran’s I

Now that we’ve defined the spatial weights matrix, it is straightforward to quantify spatial autocorrelation. Global Moran’s I is a measure of the overall clustering of spatial data, defined as

\[ I = \frac{N}{W} \frac{\sum_{i=1}^{N} \sum_{j=1}^{N} w_{ij}(x_{i} - \bar{x})(x_{j} - \bar{x})}{\sum_{i=1}^{N} (x_{i} - \bar{x})^{2}}, \]

where

  • \(N\): The number of spatial units, 49 in our case (lower 48 + District of Columbia)
  • \(W\): The sum of values in the spatial weights matrix
  • \(x\): The variable of interest (tornadoes)
  • \(\bar{x}\): The global mean of \(x\) over all spatial units
  • \(w_{ij}\): Elements of the spatial weights matrix which encodes spatial interaction


We’ll again rely on Queen contiguity. We first compute Global Moran’s I using the formula above, then compare the result with the value returned by PySAL.


wq = weights.contiguity.Queen.from_dataframe(gdf, use_index=True)
W = pd.DataFrame(*wq.full()).values

# Computing Moran's I manually. 
N = gdf.shape[0]

# Average number tornadoes over all states.
xbar = gdf["n"].mean()

# Denominator of Moran's I.
denom = np.sum((gdf["n"] - xbar)**2)

# Update numerator at each iteration.
numer = 0

for i in range(N):
    x_i = gdf.at[i, "n"]
    for j in range(N):
        x_j = gdf.at[j, "n"]
        w_ij = W[i, j]
        numer+=w_ij * (x_i - xbar) * (x_j - xbar)
        

# Compute Global Moran's I.
I1 = (N / W.sum().sum()) * numer / denom

print(f"Manual calculation of Global Moran's I: {I1:.2f}.")
Manual calculation of Global Moran's I: 0.60.


Using PySAL:


from esda.moran import Moran

I2 = Moran(gdf["n"].values,  wq)

print(f"PySAL calculation of Global Moran's I: {I2.I:.2f}.")
PySAL calculation of Global Moran's I: 0.61.

As a rule of thumb:

  • If \(I\) is above \(\frac{-1}{N - 1}\), there exists positive spatial correlation (values are clustered)
  • If \(I\) is below \(\frac{-1}{N - 1}\), there exists negative spatial correlation (neighboring values are dissimilar)
  • If \(I \approx \frac{-1}{N-1}\), there exists no spatial correlation


The significance of a correlation in Moran’s I is determined by the z-score and p-value, assuming an expected value of \(-1 / (N - 1)\). The null hypothesis states that tornadoes are spatially uncorrelated. PySAL returns the p-value of Global Moran’s I under the normality assumption:


print(f"PySAL Global Moran's I p-value: {I2.p_norm}")
PySAL Global Moran's I p-value: 3.106619400835348e-11

Which suggests it is very unlikely that the observed degree of clustering of tornadoes arose by chance.

There are many other techniques that can be used to assess spatial autocorrelation. These will be explored in a future post.