Skip to contents

Census data in Brazil are published at the census tract level. However, many research and planning applications require data at different spatial granularities, such as hexagonal grids for standardized spatial analysis, or custom polygons such as traffic zones for transportation modeling.

tracts_to_h3() and tracts_to_polygon() perform dasymetric interpolation using CNEFE dwelling points as ancillary data. Instead of assuming uniform distribution within tracts, they allocate tract-level values to individual dwelling points identified in the CNEFE and then aggregate these to the target spatial units. The allocation rule depends on the variable type:

  • Sum variables (e.g., pop_ph, race_preta, age_70m): the tract total is divided equally among eligible dwelling points. The target spatial unit (e.g., hexagons, traffic zones, or any user-supplied polygon) receives the sum of all point-level shares that fall within it.
  • Average variables (avg_inc_resp): the tract-level average is assigned (not split) to each eligible dwelling point. The target spatial unit receives the average of all assigned values within it.

This approach preserves the heterogeneous distribution of dwellings within each tract, yielding more accurate estimates than simple areal interpolation.

This article demonstrates both functions: tracts_to_h3() for transferring census variables to an H3 hexagonal grid in Fortaleza, and tracts_to_polygon() for transferring census variables to traffic zones in São Paulo.

The currently supported variables (and their respective IBGE census tract codes) can be found using the tracts_variables_ref dataframe that accompanies the package:

library(cnefetools)

tracts_variables_ref |>
  knitr::kable()
var_cnefetools code_var_ibge desc_var_ibge table_ibge
pop_ph V00005 Domicilios Particulares Permanentes Ocupados, Quantidade de moradores Domicilios
pop_ch V00007 Domicilios Coletivos Com Morador, Quantidade de moradores Domicilios
male V01007 Sexo masculino Pessoas
female V01008 Sexo feminino Pessoas
age_0_4 V01031 0 a 4 anos Pessoas
age_5_9 V01032 5 a 9 anos Pessoas
age_10_14 V01033 10 a 14 anos Pessoas
age_15_19 V01034 15 a 19 anos Pessoas
age_20_24 V01035 20 a 24 anos Pessoas
age_25_29 V01036 25 a 29 anos Pessoas
age_30_39 V01037 30 a 39 anos Pessoas
age_40_49 V01038 40 a 49 anos Pessoas
age_50_59 V01039 50 a 59 anos Pessoas
age_60_69 V01040 60 a 69 anos Pessoas
age_70m V01041 70 anos ou mais Pessoas
race_branca V01317 Cor ou raca e branca Pessoas
race_preta V01318 Cor ou raca e preta Pessoas
race_parda V01320 Cor ou raca e parda Pessoas
race_amarela V01319 Cor ou raca e amarela Pessoas
race_indigena V01321 Cor ou raca e indigena Pessoas
n_resp V06001 Pessoas responsaveis em domicilios particulares permanentes ocupados ResponsavelRenda
avg_inc_resp V06004 Valor do rendimento nominal medio mensal das pessoas responsaveis com rendimentos por domicilios particulares permanentes ocupados ResponsavelRenda

Part 1: Census variables on H3 hexagons (Fortaleza)

We interpolate three census variables to H3 resolution 8 hexagons for Fortaleza (IBGE code 2304400):

  • pop_ph: population in private households
  • avg_inc_resp: average income of household heads
  • race_preta: population self-identified as Black (preta)
ftl_h3 <- tracts_to_h3(
  code_muni = 2304400,
  h3_resolution = 8,
  vars = c("pop_ph", "avg_inc_resp", "race_preta")
)
#> ℹ Processing code 2304400
#> ℹ Step 1/6: connecting to DuckDB and loading extensions...
#> ✔ Spatial extension loaded
#> ℹ Step 1/6: connecting to DuckDB and loading extensions...
✔ Step 1/6 (DuckDB ready) [788ms]
#> 
#> ℹ Step 2/6: preparing census tracts in DuckDB...
#> ℹ Using cached file: 'sc_23.parquet'
#> ℹ Step 2/6: preparing census tracts in DuckDB...
✔ Step 2/6 (Tracts ready) [399ms]
#> 
#> ℹ Step 3/6: preparing CNEFE points in DuckDB...
#> ℹ Using cached file: C:\Users\jorge\AppData\Local/R/cache/R/cnefetools/2304400_FORTALEZA.zip
#> ℹ Step 3/6: preparing CNEFE points in DuckDB...
✔ Step 3/6 (CNEFE points ready) [7.6s]
#> 
#> ℹ Step 4/6: spatial join (points to tracts) and allocation prep...
#> ✔ Step 4/6 (Join and allocation) [1.7s]
#> 
#> ℹ Step 5/6: aggregating allocated values to H3 cells...
#> ✔ Step 5/6 (Hex aggregation) [205ms]
#> 
#> ℹ Step 6/6: building H3 grid and joining results...
#> ✔ Step 6/6 (sf output) [806ms]
#> 
#> 
#> ── Dasymetric interpolation diagnostics ──
#> 
#> ── Stage 1: Tracts → CNEFE points 
#> ! Unallocated total for population from private households (pop_ph): 2169 of
#>   2424722 (0.09%)
#> ! Unallocated total for race_preta (race_preta): 74 of 170887 (0.04%)
#> ! avg_inc_resp assigned to 1032500 of 1033953 eligible points (99.86% of total
#>   points)
#> ! avg_inc_resp is NA in 128 of 4408 tracts (2.90% of total tracts)
#> ! Unmatched CNEFE points (no tract): 194 of 1034611 points (0.02% of total
#>   points)
#> ! Tracts with NA totals: pop_ph in 128 of 4408 tracts (2.90% of total tracts);
#>   race_preta in 200 of 4408 tracts (4.54% of total tracts)
#> ! Tracts with no eligible dwellings: pop_ph in 6 of 4408 tracts (0.14% of total
#>   tracts); race_preta in 6 of 4408 tracts (0.14% of total tracts)
#> 
#> ── Stage 2: CNEFE points → H3 hexagons 
#> ℹ CNEFE points mapped to H3 cells: 1034417 of 1034417 allocated points
#>   (100.00%)

head(ftl_h3)
#> Simple feature collection with 6 features and 4 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -38.61162 ymin: -3.827556 xmax: -38.47009 ymax: -3.71684
#> Geodetic CRS:  WGS 84
#>            id_hex     pop_ph avg_inc_resp race_preta
#> 1 8880104f11fffff   547.6366     9501.404   27.97055
#> 2 8880104c6bfffff  5130.2476    11917.039  151.67999
#> 3 8880104e17fffff  2969.0585     1637.215  257.11403
#> 4 8880104553fffff  8415.9331     1245.070  871.71422
#> 5 8880104cc1fffff  6760.5683     2766.186  469.64049
#> 6 8880107b6bfffff 15033.8448     1795.090  922.54085
#>                         geometry
#> 1 POLYGON ((-38.47411 -3.7882...
#> 2 POLYGON ((-38.48272 -3.7658...
#> 3 POLYGON ((-38.53372 -3.7908...
#> 4 POLYGON ((-38.60576 -3.8275...
#> 5 POLYGON ((-38.53578 -3.7271...
#> 6 POLYGON ((-38.58644 -3.7281...

Mapping the results

We use leafsync::sync() to display the three variables side by side:

map_pop <- mapview(
  ftl_h3,
  zcol = "pop_ph",
  layer.name = "Population"
)

map_inc <- mapview(
  ftl_h3,
  zcol = "avg_inc_resp",
  layer.name = "Avg. income"
)

map_race <- mapview(
  ftl_h3,
  zcol = "race_preta",
  layer.name = "Black pop."
)

sync(map_pop, map_inc, map_race, ncol = 3)

Exploring the relationship between income and racial composition

We can examine the relationship between average household head income and the proportion of Black population in each hexagon:

ftl_h3_plot <- ftl_h3 |>
  st_drop_geometry() |>
  filter(pop_ph > 0) |>
  mutate(pct_preta = race_preta / pop_ph)

ggplot(ftl_h3_plot, aes(x = avg_inc_resp, y = pct_preta)) +
  geom_point(alpha = 0.3, size = 3) +
  geom_smooth(method = "loess", se = TRUE, color = "steelblue") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    x = "Average income of household heads (R$)",
    y = "Black population share (race_preta / pop_ph)"
  ) +
  theme_minimal()

As expected, the plot reveals a strong negative correlation between these two variables: hexagons with lower average household income tend to have a substantially higher share of Black population. This pattern reflects deep-rooted racial income inequality, a persistent feature of the Brazilian social landscape.

Part 2: Census variables on traffic zones (São Paulo)

Transportation planning often requires census-derived variables at the traffic zone level rather than the census tract level. The odbr package provides traffic zone geometries from origin-destination surveys conducted in Brazilian metropolitan areas. Since the OD survey for São Paulo covers the entire metropolitan region (39 municipalities), we filter only the traffic zones within the municipality of São Paulo.

library(odbr)

# Load traffic zones for the São Paulo metropolitan region
sp_zones <- read_map(
  city = "Sao Paulo",
  year = 2017,
  geometry = "zone"
)

# Filter zones within the municipality of São Paulo
sp_zones_muni <- sp_zones |>
  filter(NomeMunici == "São Paulo") |> 
  sf::st_make_valid() # Correcting invalid geometries in advance

nrow(sp_zones_muni)
#> [1] 342

We interpolate two variables:

  • pop_ph: population in private households
  • age_70m: population aged 70 years or older
sp_zones_census <- tracts_to_polygon(
  code_muni = 3550308,
  polygon = sp_zones_muni, 
  vars = c("pop_ph", "age_70m")
)
#> ℹ Processing municipality code 3550308...
#> ℹ Step 1/6: aligning CRS...
#> ℹ Input CRS: "EPSG:22523" | Output CRS: "EPSG:22523"
#> ℹ Step 1/6: aligning CRS...
✔ Step 1/6 (CRS alignment) [273ms]
#> 
#> ℹ Step 2/6: connecting to DuckDB and loading extensions...
#> ✔ Spatial extension loaded
#> ℹ Step 2/6: connecting to DuckDB and loading extensions...
✔ Step 2/6 (DuckDB ready) [476ms]
#> 
#> ℹ Step 3/6: preparing census tracts in DuckDB...
#> ℹ Using cached file: 'sc_35.parquet'
#> ℹ Step 3/6: preparing census tracts in DuckDB...
✔ Step 3/6 (Tracts ready) [4.8s]
#> 
#> ℹ Step 4/6: preparing CNEFE points in DuckDB...
#> ℹ Using cached file: C:\Users\jorge\AppData\Local/R/cache/R/cnefetools/3550308_SAO_PAULO.zip
#> ℹ Step 4/6: preparing CNEFE points in DuckDB...
✔ Step 4/6 (CNEFE points ready) [15.4s]
#> 
#> ℹ Step 5/6: spatial join (points to tracts) and allocation...
#> ✔ Step 5/6 (Join and allocation) [46.5s]
#> 
#> ℹ Step 6/6: aggregating allocated values to polygons...
#> ✔ Step 6/6 (Polygon aggregation) [310ms]
#> 
#> 
#> ── Dasymetric interpolation diagnostics ──
#> 
#> ── Stage 1: Tracts → CNEFE points 
#> ! Unallocated total for population from private households (pop_ph): 86998 of
#>   11394071 (0.76%)
#> ! Unallocated total for age_70m (age_70m): 5397 of 923921 (0.58%)
#> ! Unmatched CNEFE points (no tract): 2935 of 4996529 points (0.06% of total
#>   points)
#> ! Tracts with NA totals: pop_ph in 622 of 27301 tracts (2.28% of total tracts);
#>   age_70m in 1316 of 27301 tracts (4.82% of total tracts).
#> ! Tracts with no eligible dwellings: pop_ph in 324 of 27301 tracts (1.19% of
#>   total tracts); age_70m in 297 of 27301 tracts (1.09% of total tracts)
#> 
#> ── Stage 2: CNEFE points → Polygons 
#> ℹ Polygon coverage: 4993321 of 4993594 allocated points captured (99.99%)
#> ℹ Polygons with no CNEFE points: 3 of 342 total polygons (0.88%)

head(sp_zones_census)
#> Simple feature collection with 6 features and 9 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 331820.5 ymin: 7393922 xmax: 334180.5 ymax: 7396311
#> Projected CRS: Corrego Alegre 1970-72 / UTM zone 23S
#> # A tibble: 6 × 10
#>   NumeroZona NomeZona      NumeroMuni NomeMunici NumDistrit NomeDistri Area_ha_2
#>        <dbl> <chr>              <dbl> <chr>           <dbl> <chr>          <dbl>
#> 1          1 Sé                    36 São Paulo          80 Sé              57.1
#> 2          2 Parque Dom P…         36 São Paulo          80 Sé             114. 
#> 3          3 Praça João M…         36 São Paulo          80 Sé              47.8
#> 4          4 Ladeira da M…         36 São Paulo          67 República       75.1
#> 5          5 República             36 São Paulo          67 República       75.0
#> 6          6 Santa Ifigên…         36 São Paulo          67 República       82.9
#> # ℹ 3 more variables: geom <POLYGON [m]>, pop_ph <dbl>, age_70m <dbl>

Mapping the elderly population share

We compute the share of population aged 70+ in each traffic zone. This indicator is relevant for identifying areas where transport planning should prioritize accessibility for people with reduced mobility, such as low-floor buses, accessible sidewalks, and demand-responsive transit services.

sp_zones_ratio <- sp_zones_census |>
  filter(pop_ph > 0) |>
  mutate(pct_70m = age_70m / pop_ph)

mapview(
  sp_zones_ratio,
  zcol = "pct_70m",
  layer.name = "Share of pop. aged 70+"
)

Notes

Notice how tracts_to_polygon() is particularly useful in transportation research, where census variables (such as population by age group and income) are essential inputs for travel demand models (especially in Trip Generation models from 4-Step modeling). However, traffic zone geometries rarely align with census tract boundaries. Dasymetric interpolation via CNEFE dwelling points provides a principled method to bridge this spatial mismatch, transferring census data to custom spatial units and supporting more accurate trip generation and distribution models.

Hint to scaling to multiple municipalities

The odbr traffic zones cover the entire São Paulo metropolitan region (39 municipalities). To interpolate census data for all municipalities, you can use purrr::map() to apply tracts_to_polygon() to each municipality and then combine the results with dplyr::bind_rows().

Future versions

In future versions of the package, tracts_to_polygon() will support municipality-independent operation, allowing users to pass polygons that span multiple municipalities without the need for per-municipality iteration.

Interpolation diagnostics

Both functions print detailed diagnostics after each run, so the user can assess the quality of the interpolation. For example, the tracts_to_h3() call for Fortaleza above produces:

── Dasymetric interpolation diagnostics ──
! Unallocated total for population from private households (pop_ph): "2169" ("0.09%" of total)
! Unallocated total for race_preta (race_preta): "74" ("0.04%" of total)
! avg_inc_resp assigned to 1032500 of 1033953 eligible points
! avg_inc_resp is "NA" in 128 tracts
! Unmatched CNEFE points (no tract): 194
! Tracts with "NA" totals: pop_ph "NA" in 128; race_preta "NA" in 200
! Tracts with no eligible dwellings: pop_ph in 6 tracts; race_preta in 6 tracts

And the tracts_to_polygon() call for São Paulo:

── Dasymetric interpolation diagnostics ──
! Unallocated total for population from private households (pop_ph): "86998" ("0.76%" of total)
! Unallocated total for age_70m (age_70m): "5397" ("0.58%" of total)
! Unmatched CNEFE points (no tract) = 2935
! Tracts with "NA" totals: pop_ph "NA" in 622; age_70m "NA" in 1316
! Tracts with no eligible dwellings: pop_ph in 324 tracts; age_70m in 297 tracts

These warnings report the amount of unallocated population (and its percentage of the census total), the number of CNEFE points that could not be matched to a tract, and the number of tracts with missing values or no eligible dwellings. In both examples, the unallocated shares are small (below 1%), indicating that the interpolation preserves most of the original census totals. Users should inspect these diagnostics to decide whether the loss is acceptable for their application.

Ecological fallacy caveat

Dasymetric interpolation distributes tract-level aggregates to dwelling points under the assumption that each dwelling within a tract receives an equal share of the total. This means that the method does not recover true individual- or dwelling-level values; it redistributes a group statistic. Drawing conclusions about individuals from such aggregated data constitutes an ecological fallacy (the error of attributing group-level patterns to individuals).

The risk is greater when the source census tracts are large, since larger tracts are more likely to contain internally heterogeneous populations, making the uniform allocation assumption less realistic. Users should interpret interpolated values as spatial estimates of aggregate quantities, not as precise measurements at the dwelling or individual level.