geoprepare

image

A Python package to prepare (download, extract, process input data) for GEOCIF and related models

Installation

Note: The instructions below have only been tested on a Linux system

Install Anaconda

We recommend that you use the conda package manager to install the geoprepare library and all its dependencies. If you do not have it installed already, you can get it from the Anaconda distribution

Using the CDS API

If you intend to download AgERA5 data, you will need to install the CDS API. You can do this by following the instructions here

geoprepare requires multiple Python GIS packages including gdal and rasterio. These packages are not always easy to install. To make the process easier, you can optionally create a new environment using the following commands, specify the python version you have on your machine (python >= 3.9 is recommended). we use the pygis library to install multiple Python GIS packages including gdal and rasterio.

conda create --name <name_of_environment> python=3.x
conda activate <name_of_environment>
conda install -c conda-forge mamba
mamba install -c conda-forge gdal
mamba install -c conda-forge rasterio
mamba install -c conda-forge xarray
mamba install -c conda-forge rioxarray
mamba install -c conda-forge pyresample
mamba install -c conda-forge cdsapi
mamba install -c conda-forge pygis
pip install wget
pip install pyl4c

Install the octvi package to download MODIS data

pip install git+https://github.com/ritviksahajpal/octvi.git

Downloading from the NASA distributed archives (DAACs) requires a personal app key. Users must configure the module using a new console script, octviconfig. After installation, run octviconfig in your command prompt to prompt the input of your personal app key. Information on obtaining app keys can be found here

Using PyPi (default)

pip install --upgrade geoprepare

Using Github repository (for development)

pip install --upgrade --no-deps --force-reinstall git+https://github.com/ritviksahajpal/geoprepare.git

Local installation

Navigate to the directory containing pyproject.toml and run the following command:

pip install .

For development (editable install):

pip install -e ".[dev]"

Pipeline

geoprepare follows a three-stage pipeline:

  1. Download (geodownload) - Download and preprocess global EO datasets to dir_download and dir_intermed
  2. Extract (geoextract) - Extract EO variable statistics per admin region to dir_output
  3. Merge (geomerge) - Merge extracted EO files into per-country/crop CSV files for ML models and AgMet graphics

All datasets store files in year-specific subfolders (e.g., dir_intermed/cpc_tmax/2024/, dir_download/nsidc/2025/).

Additional utilities:

Usage

config_dir = "/path/to/config"  # full path to your config directory

cfg_geoprepare = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geoextract.txt"]

1. Download data (geodownload)

Downloads and preprocesses global EO datasets. Only requires geobase.txt. The [DATASETS] section controls which datasets are downloaded. Each dataset is processed to global 0.05° TIF files in dir_intermed.

from geoprepare import geodownload
geodownload.run([f"{config_dir}/geobase.txt"])

2. Migrate to year subfolders (geomove)

Moves existing files from flat directories into year-specific subfolders. Run this once after upgrading to a version with year-subfolder support. All datasets are handled: CPC, ESI, NDVI, NSIDC, CHIRPS-GEFS, LST, Soil Moisture, AgERA5, VHI, FPAR, and AEF.

from geoprepare import geomove

# Preview what would be moved (no files are changed)
geomove.run([f"{config_dir}/geobase.txt"], dry_run=True)

# Execute the migration
geomove.run([f"{config_dir}/geobase.txt"])

3. Validate downloads (geocheck)

Checks that all expected TIF files exist in dir_intermed and are non-empty. Writes a timestamped report to dir_logs/check/.

from geoprepare import geocheck
geocheck.run([f"{config_dir}/geobase.txt"])

4. Extract crop masks and EO data (geoextract)

Extracts EO variable statistics (mean, median, etc.) for each admin region, crop, and growing season.

from geoprepare import geoextract
geoextract.run(cfg_geoprepare)

5. Merge extracted data (geomerge)

Merges per-region/year EO CSV files into a single CSV per country-crop-season combination.

from geoprepare import geomerge
geomerge.run(cfg_geoprepare)

Config files

File Purpose Used by
geobase.txt Paths, dataset settings, boundary file column mappings, logging both
countries.txt Per-country config (boundary files, admin levels, seasons, crops) both
crops.txt Crop masks, calendar category settings (EWCM, AMIS) both
geoextract.txt Extraction-only settings (method, threshold, parallelism) geoprepare
geocif.txt Indices/ML/agmet settings, country overrides, runtime selections geocif

Order matters: Config files are loaded left-to-right. When the same key appears in multiple files, the last file wins. The tool-specific file (geoextract.txt or geocif.txt) must be last so its [DEFAULT] values (countries, method, etc.) override the shared defaults in countries.txt.

config_dir = "/path/to/config"  # full path to your config directory

cfg_geoprepare = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geoextract.txt"]
cfg_geocif = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geocif.txt"]

Config file documentation

geobase.txt

Shared paths, dataset settings, boundary file column mappings, and logging. All directory paths are derived from dir_base.

[DATASETS]
datasets = ['CHIRPS', 'CPC', 'NDVI', 'ESI', 'NSIDC', 'AEF']
; Other available: 'CHIRPS-GEFS', 'AGERA5', 'FLDAS', 'LST', 'VHI', 'FPAR', 'SOIL-MOISTURE', 'AVHRR', 'VIIRS'

[PATHS]
dir_base = /gpfs/data1/cmongp1/GEO

dir_inputs = ${dir_base}/inputs
dir_logs = ${dir_base}/logs
dir_download = ${dir_inputs}/download
dir_intermed = ${dir_inputs}/intermed
dir_metadata = ${dir_inputs}/metadata
dir_condition = ${dir_inputs}/crop_condition
dir_crop_inputs = ${dir_condition}/crop_t20

dir_boundary_files = ${dir_metadata}/boundary_files
dir_crop_calendars = ${dir_metadata}/crop_calendars
dir_crop_masks = ${dir_metadata}/crop_masks
dir_images = ${dir_metadata}/images
dir_production_statistics = ${dir_metadata}/production_statistics

dir_output = ${dir_base}/outputs

; --- Per-dataset settings ---

[AEF]
; AlphaEarth Foundations satellite embeddings (2018-2024, 64 channels, 10m)
; Source: https://source.coop/tge-labs/aef  |  License: CC-BY 4.0
; Countries are read from geoextract.txt [DEFAULT] countries
buffer = 0.5
download_vrt = True
start_year = 2018
end_year = 2024

[AGERA5]
variables = ['Precipitation_Flux', 'Temperature_Air_2m_Max_24h', 'Temperature_Air_2m_Min_24h']

[AVHRR]
data_dir = https://www.ncei.noaa.gov/data/avhrr-land-normalized-difference-vegetation-index/access

[CHIRPS]
fill_value = -2147483648
; CHIRPS version: 'v2' for CHIRPS-2.0 or 'v3' for CHIRPS-3.0
version = v3
; Disaggregation method for v3 only: 'sat' (IMERG) or 'rnl' (ERA5)
; - 'sat': Uses NASA IMERG Late V07 for daily downscaling (available from 1998, 0.1° resolution)
; - 'rnl': Uses ECMWF ERA5 for daily downscaling (full time coverage, 0.25° resolution)
; Note: Prelim data is only available with 'sat' due to ERA5 latency (5-6 days)
disagg = sat

[CHIRPS-GEFS]
fill_value = -2147483648
data_dir = /pub/org/chc/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/15day/precip_mean/

[CPC]
data_dir = ftp://ftp.cdc.noaa.gov/Datasets

[ESI]
data_dir = https://gis1.servirglobal.net//data//esi//
list_products = ['4wk', '12wk']

[FLDAS]
use_spear = False
data_types = ['forecast']
variables = ['SoilMoist_tavg', 'TotalPrecip_tavg', 'Tair_tavg', 'Evap_tavg', 'TWS_tavg']
leads = [0, 1, 2, 3, 4, 5]
compute_anomalies = False

[FPAR]
data_dir = https://agricultural-production-hotspots.ec.europa.eu//data//indicators_fpar//fpar//

[LST]
num_update_days = 7

[NDVI]
product = MOD09CMG
vi = ndvi
scale_glam = False
scale_mark = True
print_missing = False

[VIIRS]
product = VNP09CMG
vi = ndvi
scale_glam = False
scale_mark = True
print_missing = False

[NSIDC]

[SOIL-MOISTURE]
data_dir = https://gimms.gsfc.nasa.gov/SMOS/SMAP/L03/

[VHI]
data_historic = https://www.star.nesdis.noaa.gov/data/pub0018/VHPdata4users/VHP_4km_GeoTiff/
data_current = https://www.star.nesdis.noaa.gov/pub/corp/scsb/wguo/data/Blended_VH_4km/geo_TIFF/

; --- Boundary file column mappings ---
; Section name = filename stem (without extension)
; Maps source shapefile columns to standard internal names:
;   adm0_col  -> ADM0_NAME (country)
;   adm1_col  -> ADM1_NAME (admin level 1)
;   adm2_col  -> ADM2_NAME (admin level 2, optional)
;   id_col    -> ADM_ID    (unique feature ID)

[adm_shapefile]
adm0_col = ADMIN0
adm1_col = ADMIN1
adm2_col = ADMIN2
id_col = FNID

[gaul1_asap_v04]
adm0_col = name0
adm1_col = name1
id_col = asap1_id

[EWCM_Level_1]
adm0_col = ADM0_NAME
adm1_col = ADM1_NAME
id_col = num_ID

; Add more [boundary_stem] sections as needed for other shapefiles

[LOGGING]
level = ERROR

[POOCH]
; URL to download metadata.zip (boundary files, crop masks, calendars, etc.)
; NOTE: Set this to your own hosted URL (e.g. Dropbox, S3, etc.)
url = <your_metadata_zip_url>
enabled = True

[DEFAULT]
logfile = log
parallel_process = False
fraction_cpus = 0.35
start_year = 2001
end_year = 2026

countries.txt

Single source of truth for per-country config. Each country owns its calendar_file, crops, eo_model, and other settings. Shared by both geoprepare and geocif.

[DEFAULT]
boundary_file = gaul1_asap_v04.shp
admin_level = admin_1
seasons = [1]
crops = ['maize']
category = AMIS
use_cropland_mask = False
calendar_file = crop_calendar.csv
mask = cropland_v9.tif
statistics_file = statistics.csv
zone_file = countries.csv
shp_region = GlobalCM_Regions_2025-11.shp
eo_model = ['aef', 'nsidc_surface', 'nsidc_rootzone', 'ndvi', 'cpc_tmax', 'cpc_tmin', 'chirps', 'chirps_gefs', 'esi_4wk']
annotate_regions = False

;;; AMIS countries (inherit from DEFAULT, override crops if needed) ;;;
[argentina]
crops = ['soybean', 'winter_wheat', 'maize']

[brazil]
crops = ['maize', 'soybean', 'winter_wheat', 'rice']

[india]
crops = ['rice', 'maize', 'winter_wheat', 'soybean']

[united_states_of_america]
crops = ['rice', 'maize', 'winter_wheat']

; ... (40+ AMIS countries, most inherit DEFAULT crops)

;;; EWCM countries (full per-country config) ;;;
[kenya]
category = EWCM
admin_level = admin_1
seasons = [1, 2]
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2026-01-05.xlsx
crops = ['maize']

[malawi]
category = EWCM
admin_level = admin_2
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2026-01-05.xlsx
crops = ['maize']

[ethiopia]
category = EWCM
admin_level = admin_2
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2026-01-05.xlsx
crops = ['maize', 'sorghum', 'millet', 'rice', 'winter_wheat', 'teff']

; ... (30+ EWCM countries, mostly Sub-Saharan Africa)

;;; Other countries (custom boundary files, non-standard setups) ;;;
[nepal]
crops = ['rice']
boundary_file = hermes_NPL_new_wgs_2.shp

[illinois]
admin_level = admin_3
boundary_file = illinois_counties.shp

crops.txt

Crop mask filenames and calendar category settings. Calendar categories define shared settings (cropland masking, boundary files, growing seasons) for groups of countries.

;;; Crop masks ;;;
[winter_wheat]
mask = Percent_Winter_Wheat.tif

[spring_wheat]
mask = Percent_Spring_Wheat.tif

[maize]
mask = Percent_Maize.tif

[soybean]
mask = Percent_Soybean.tif

[rice]
mask = Percent_Rice.tif

[teff]
mask = cropland_v9.tif

[sorghum]
mask = cropland_v9.tif

[millet]
mask = cropland_v9.tif

;;; Calendar categories ;;;
[EWCM]
use_cropland_mask = True
shp_boundary = adm_shapefile.gpkg
growing_seasons = [1]  ; 1 is primary/long season, 2 is secondary/short season

[AMIS]

geoextract.txt

Extraction-only settings for geoprepare. Loaded last so its [DEFAULT] overrides shared defaults.

[DEFAULT]
start_year = 2001
end_year = 2026
project_name = geocif
method = JRC
redo = False
threshold = True
floor = 20
ceil = 90
parallel_extract = True
parallel_merge = True
fraction_cpus = 0.6
countries = ["malawi"]
forecast_seasons = [2026]

geocif.txt

Indices, ML, and agmet settings for geocif. Country overrides go here when geocif needs different values than countries.txt (e.g., a subset of crops). Its [DEFAULT] section is loaded last and overrides shared defaults for geocif runs.

[AGMET]
eo_plot = ['ndvi', 'cpc_tmax', 'cpc_tmin', 'chirps', 'esi_4wk', 'nsidc_surface', 'nsidc_rootzone']
logo_harvest = harvest.png
logo_geoglam = geoglam.png

;;; Country overrides (only where geocif differs from countries.txt) ;;;
[bangladesh]
crops = ['rice']
admin_level = admin_2
boundary_file = bangladesh.shp
annotate_regions = False
input_file_path = ${PATHS:dir_output}/countries

[ethiopia]
crops = ['winter_wheat']

[india]
crops = ['soybean', 'maize', 'rice']

[russian_federation]
crops = ['winter_wheat', 'maize']

[somalia]
crops = ['maize']

[ukraine]
crops = ['winter_wheat', 'maize']

;;; ML model definitions ;;;
[linear]
ml_model = True

[gam]
ml_model = True

[analog]
ML_model = False

[median]
ML_model = False

[catboost]
ML_model = True

[desreg]
ML_model = True

[ngboost]
ML_model = True

[tabpfn]
ML_model = True

; ... (additional models: tabicl, cumulative_*, oblique, merf, cubist, ydf, etc.)

[ML]
model_type = REGRESSION
target = Yield (tn per ha)
feature_selection = multi
lag_years = 3
panel_model = True
panel_model_region = Country
median_years = 5
lag_yield_as_feature = True
run_latest_time_period = True
run_every_time_period = 3
cat_features = ["Harvest Year", "Region_ID", "Region"]
loocv_var = Harvest Year
check_yield_trend = True
detrend_method = gaussian

[LOGGING]
log_level = ERROR

[DEFAULT]
data_source = harvest
method = monthly_r
project_name = geocif
countries = ["malawi"]
crops = ['maize']
admin_level = admin_1
models = ['catboost']
seasons = [1]
threshold = True
floor = 20
fraction_cpus = 0.7
input_file_path = ${PATHS:dir_crop_inputs}/processed

Supported datasets

Dataset Description Source
AEF AlphaEarth Foundations satellite embeddings (64-band, 10m) source.coop
AGERA5 Agrometeorological indicators (precipitation, temperature) CDS
AVHRR Long-term NDVI NOAA NCEI
CHIRPS Rainfall estimates (v2 and v3) CHC
CHIRPS-GEFS 15-day precipitation forecasts CHC
CPC Temperature (Tmax, Tmin) and precipitation NOAA CPC
ESI Evaporative Stress Index (4-week, 12-week) SERVIR
FLDAS Land surface model outputs (soil moisture, precip, temp) NASA
FPAR Fraction of Absorbed Photosynthetically Active Radiation JRC
LST Land Surface Temperature (MODIS MOD11C1) NASA
NDVI Vegetation index from MODIS (MOD09CMG) NASA
NSIDC SMAP L4 soil moisture (surface, rootzone) NASA NSIDC
SOIL-MOISTURE NASA-USDA soil moisture (surface as1, subsurface as2) NASA
VHI Vegetation Health Index NOAA STAR
VIIRS Vegetation index from VIIRS (VNP09CMG) NASA

Directory layout

All datasets organize files into year-specific subfolders. After running geomove (or on fresh downloads), the directory structure looks like:

dir_download/
  nsidc/2025/*.h5, nsidc/2026/*.h5
  chirps_gefs/2026/*.tif
  fpar/2024/*.tif, fpar/2025/*.tif
  modis_lst/*.hdf                     (flat - pymodis manages this)
  ...

dir_intermed/
  cpc_tmax/2024/*.tif, cpc_tmax/2025/*.tif
  cpc_tmin/2024/*.tif, ...
  cpc_precip/2024/*.tif, ...
  chirps/v3/global/2024/*.tif, ...    (CHIRPS already used year subfolders)
  chirps_gefs/2026/*.tif
  esi_4wk/2024/*.tif, ...
  esi_12wk/2024/*.tif, ...
  ndvi/2024/*.tif, ...
  lst/2024/*.tif, ...
  nsidc/subdaily/2025/*.tif
  nsidc/daily/surface/2025/*.tif
  nsidc/daily/rootzone/2025/*.tif
  soil_moisture_as1/2024/*.tif, ...
  soil_moisture_as2/2024/*.tif, ...
  agera5/tif/{variable}/2024/*.tif, ...
  vhi/global/2024/*.tif, ...
  aef/{country}/2018/*.tif, ..., aef/{country}/aef_avg_global.tif
  fldas/.../2024/*.tif, ...           (FLDAS already used year subfolders)

Upload package to PyPI

Navigate to the root of the geoprepare repository (the directory containing pyproject.toml):

cd /path/to/geoprepare

Step 1: Update version

Use bump2version to update the version in both pyproject.toml and geoprepare/__init__.py:

Using uv:

uvx bump2version patch --current-version X.X.X --new-version X.X.Y pyproject.toml geoprepare/__init__.py

Using pip:

pip install bump2version
bump2version patch --current-version X.X.X --new-version X.X.Y pyproject.toml geoprepare/__init__.py

Or manually edit the version in pyproject.toml and geoprepare/__init__.py.

Step 2: Clean old builds

Linux/macOS:

rm -rf dist/ build/ *.egg-info/

Windows (Command Prompt):

rmdir /s /q dist build geoprepare.egg-info

Windows (PowerShell):

Remove-Item -Recurse -Force dist/, build/, *.egg-info/ -ErrorAction SilentlyContinue

Step 3: Build and upload

Using uv (Linux/macOS):

uv build
uvx twine check dist/*
uvx twine upload dist/geoprepare-X.X.X*

Using uv (Windows):

uv build
uvx twine check dist\geoprepare-X.X.X.tar.gz dist\geoprepare-X.X.X-py3-none-any.whl
uvx twine upload dist\geoprepare-X.X.X.tar.gz dist\geoprepare-X.X.X-py3-none-any.whl

Using pip:

pip install build twine
python -m build
twine check dist/*
twine upload dist/geoprepare-X.X.X*

Replace X.X.X with your current version and X.X.Y with the new version.

Optional: Configure PyPI credentials

To avoid entering credentials each time, create a ~/.pypirc file (Linux/macOS) or %USERPROFILE%\.pypirc (Windows):

[pypi]
username = __token__
password = pypi-YOUR_API_TOKEN_HERE

Credits

This project was supported by NASA Applied Sciences Grant No. 80NSSC17K0625 through the NASA Harvest Consortium, and the NASA Acres Consortium under NASA Grant #80NSSC23M0034.