# H3 Deidentification using Python
*Melissa Pearson, 10/24/2025*  

This notebook follows the same basic forumlas as the observable notebook, just using python libraries.

## Import Pyhon libraries
* pandas: data manipulation
* h3: h3 geospatial conversions
* folium: mapping

In [2]:
import pandas as pd
import h3
import folium

## Load data
Previous work extracted the Texas population data from the kontur population dataset and downloaded a list of texas school data that was condensed to reduce the number of columns.  Here this data is loaded from csv files stored locally. 

In [3]:
schools=pd.read_csv('tx_schools_condensed.csv', index_col=0)
pop = pd.read_csv('txpop_h3.csv', index_col=0)

## Rewrite Observable function into python
Observable Anonymization function
```
outputCellValues = {
  const outputCells = new Map();

  function getAnonymizingSum(cells) {
    return cells.map(cell => popMapSchools.get(cell) ?? 0).reduce((sum, val) => sum + val, 0);
  }
  
  for (const {lat, lng} of schoolLocations) {
    // Get the set of cells at the target res that have an anonymizing value over the threshold
    let currentRes = h3Resolution;
    let candidateSet = [h3.latLngToCell(lat, lng, currentRes)];
    while (getAnonymizingSum(candidateSet) < populationThreshold && currentRes > 3) {
      currentRes--;
      const parentCell = h3.latLngToCell(lat, lng, currentRes);
      candidateSet = h3.cellToChildren(parentCell, h3Resolution);
    }
    // Add all cells to the output map
    for (const cell of candidateSet) {
      const currentValue = outputCells.get(cell) ?? 0;
      outputCells.set(cell, currentValue + (1 / candidateSet.length))
    }
  }

  return outputCells;
}
```

Analysis of Steps:

1. Function getAnonymizingSum takes in a list of H3 Cells.  It references the TX H3 population values for those cells and sums the list, using 0 if the value of the cell is null or undefined.  ('??' is the [Nullish Coalescing Pperator](https://www.w3schools.com/jsref/jsref_oper_nullish.asp))
2. **OUTER LOOP** Loop through all the points in the list of data points (here: school locations)
    1. Specify h3 resolution to use
    2. Initiate candidate set list of h3 cells with the one containing the point at the specified resolution
3. **INNER LOOP #1**
    1. While loop conditioned on the sum of the populations in the candidate set list (initiated to a list of length 1 with the pop of the initial cell) being less than the anonymizing threshold specified (this is specified outside the function - likely should be a function initial parameter) and a resolution  > 3
    2. Take current resolution and go down a level (i.e. start at 8 and first loop is at 7.
    3. Get the parent cell of your res8 cell at res 7
    4. Get ALL children of the parent cell down to the original resolution
4. **INNER LOOP #2**
    1. Once sum of candidate cells is over threshold, divide the metric value over the full candidate set
  

## Conversion to Python function

In [4]:
def get_output_cell_values(school_locations, pop_map_schools, h3_resolution, population_threshold):
    """
    Calculate output cell values with anonymizing aggregation.
    
    Args:
        school_locations: DataFrame with 'Lat' and 'Long' columns
        pop_map_schools: DataFrame H3 cells to population values
        h3_resolution: Target H3 resolution
        population_threshold: Minimum population threshold for anonymization
    
    Returns:
        Dict mapping H3 cells to aggregated values
    """
    import h3
    
    # Convert pop_map_schools DataFrame to dict for faster lookups
    pop_map_dict = pop_map_schools.set_index('h3')['population'].to_dict()
        
    output_cells_dict = {}
    
    def get_anonymizing_sum(cells):
        return sum(pop_map_dict.get(cell, 0) for cell in cells)
    
    results = []
    
    for _, row in school_locations.iterrows():
        lat, lng = row['lat'], row['lng']
        
        # Get the set of cells at the target res that have an anonymizing value over the threshold
        current_res = h3_resolution        
        candidate_set = [h3.latlng_to_cell(lat, lng, current_res)]
        
        while get_anonymizing_sum(candidate_set) < population_threshold and current_res > 3:
            current_res -= 1
            parent_cell = h3.latlng_to_cell(lat, lng, current_res)
            candidate_set = list(h3.cell_to_children(parent_cell, h3_resolution))

        # Create dict for each location with lat, lng and deononymized set of cells
        row_dict = {
            'lat':lat,
            'lng':lng,
            'cell_set': candidate_set
        }
        results.append(row_dict)
  
    return results

## Example

Run the deanonymization function on the schools dataset, taking the input data and putting in into the expected formation

In [5]:
# Prepare school locations
school_locations = schools.copy()
school_locations.rename(columns={'Long':'lng','Lat':'lat'},inplace=True)

In [7]:
schools_deanon_100 = get_output_cell_values(school_locations=school_locations, pop_map_schools=pop, h3_resolution=8, population_threshold=100)

In [8]:
len(schools_deanon_100)

9555

In [9]:
#first item in deanonumized list:
schools_deanon_100[0]

{'lat': 30.98888000002848,
 'lng': -97.36670999985324,
 'cell_set': ['88489a1657fffff']}

In [10]:
# convert to dataframe
sdf = pd.DataFrame(schools_deanon_100)

In [11]:
# combine with schools location on lat / lng columns
school_locations = school_locations.merge(sdf, on=['lat','lng'])

In [12]:
school_locations.head()

Unnamed: 0,lng,lat,LongLabel,School_Nam,County_Nam,City,Subregion,cell_set
0,-97.36671,30.98888,"Little River, Little River-Academy, TX, USA",ACADEMY JJAEP,BELL COUNTY,Little River-Academy,Bell County,[88489a1657fffff]
1,-95.436392,29.957201,"11902 Spears Rd, Houston, TX, 77067, USA",FORTIS ACADEMY,HARRIS COUNTY,Houston,Harris County,[88446c350dfffff]
2,-95.436392,29.957201,"11902 Spears Rd, Houston, TX, 77067, USA",FORTIS ACADEMY,HARRIS COUNTY,Houston,Harris County,[88446c350dfffff]
3,-97.4014,30.87914,"Holland, TX, USA",BELL COUNTY JJAEP,BELL COUNTY,Holland,Bell County,[88489a8e85fffff]
4,-97.222274,31.397236,"1191 Old Lorena Rd, Lorena, TX, 76655, USA",LORENA PRI,MCLENNAN COUNTY,Lorena,McLennan County,[88489a39cdfffff]


## Select subset of data to map
The map can be slow to load with all 9000+ entries.  So we will select a subset by county, specifically the counties around Austin

In [13]:
austin_counties = ['Travis County','Willamson County', 'hays County', 'astrop County']
austin_schools = school_locations[school_locations['Subregion']=='Travis County']

In [14]:
austin_schools.head()

Unnamed: 0,lng,lat,LongLabel,School_Nam,County_Nam,City,Subregion,cell_set
103,-97.833076,30.143196,"12120 Manchaca Rd, Austin, TX, 78748, USA",MENCHACA EL,TRAVIS COUNTY,Austin,Travis County,[88489eacc1fffff]
116,-97.832417,30.457305,"12200 Anderson Mill Rd, Austin, TX, 78726, USA",HARMONY SCIENCE ACADEMY - CEDAR PARK,TRAVIS COUNTY,Austin,Travis County,[88489e2697fffff]
163,-97.561461,30.357418,"12900 Gregg Manor Rd, Manor, TX, 78653, USA",MANOR MIDDLE,TRAVIS COUNTY,Manor,Travis County,"[88489e21c1fffff, 88489e21c3fffff, 88489e21c5f..."
249,-97.698493,30.438444,"13801 Burnet Rd, Austin, TX, 78727, USA",PREMIER H S OF NORTH AUSTIN,ERATH COUNTY,Austin,Travis County,[88489e2553fffff]
373,-97.967627,30.308498,"14300 Hamilton Pool Rd, Austin, TX, 78738, USA",BEE CAVE EL,TRAVIS COUNTY,Austin,Travis County,"[884898ca51fffff, 884898ca53fffff, 884898ca55f..."


In [15]:
## Map with Folium

In [16]:
# Create a Folium map centered on Texas
m = folium.Map(location=[30, -98], zoom_start=10)

In [17]:
# Add H3 polygons
# Iterate through each row
for idx, row in austin_schools.iterrows():
    cells = row['cell_set']
    
    if not isinstance(cells, list) or len(cells) == 0:
        continue

    # Add each H3 cell
    for cell in cells:
        boundary = h3.cell_to_boundary(cell)
        polygon_coords = [[lat, lng] for lat, lng in boundary]
        
        folium.Polygon(
            locations=polygon_coords,
            color='red',
            weight=1,
            fill=False,
            fillColor='red',
            fillOpacity=0.1,
            # opacity=0.8,
            # tooltip=tooltip_text
        ).add_to(m)

In [19]:
# add actual school locations as dots on the map to assess the deidentification function 
for i in range(0,len(austin_schools)):
   folium.Circle(
      location=[austin_schools.iloc[i]['lat'], austin_schools.iloc[i]['lng']],
      popup=austin_schools.iloc[i]['School_Nam'],
      radius=10
   ).add_to(m)

In [20]:
# show map
m