Commit d4a7e24e authored by Jonathan Minz's avatar Jonathan Minz
Browse files

created CF-Compliance-Checker.md

parent 9bcd7de8
Loading
Loading
Loading
Loading
+144 −0
Original line number Diff line number Diff line
# CF Compliance Checker Script

## Overview

The `check_cf_compliance.py` script automates the validation of **NetCDF files** against the [CF (Climate and Forecast) Metadata Conventions](http://cfconventions.org/).  

It uses the [CF-Checker](https://github.com/cedadev/cf-checker) utility (`cfchecks`) and is designed for workflows where NetCDF datasets are organized in subdirectories. The script:

- Finds the **first NetCDF file** in each subdirectory of a given parent directory.  
- Runs **CF-Checker** on those files.  
- Saves results into two separate logs:
  - `cf_compliance_details.log` → detailed CF-Checker output (per file)  
  - `cf_compliance_summary.log` → summary table of results (across all files)  
- Optionally exports the summary to **CSV** or **Excel**, with timestamped filenames for reproducibility.  

---

## Workflow

```mermaid
flowchart TD
    A[Parent Directory] --> B[Find first NetCDF file in each subdir]
    B --> C[Run cfchecks on each file]
    C --> D[Parse CF-Checker summary (Errors, Warnings, Info)]
    D --> E[cf_compliance_details.log<br>(per-file details)]
    D --> F[cf_compliance_summary.log<br>(summary table)]
    F --> G[Optional CSV Export<br>(timestamped)]
    F --> H[Optional Excel Export<br>(timestamped)]
```

---

## Requirements

### Python packages
- Python ≥ 3.7  
- `pandas` (for CSV/Excel export)  
- `tabulate` (for pretty summary tables)  

Install with conda:
```bash
conda install pandas tabulate
```

or with pip:
```bash
pip install pandas tabulate
```

### CF-Checker
The script depends on the `cfchecks` command-line tool from [CEDA’s CF-Checker](https://github.com/cedadev/cf-checker).

Install with conda (recommended):
```bash
conda install -c conda-forge cfchecker
```

or with pip:
```bash
pip install cfchecker
```

---

## Usage

```bash
python check_cf_compliance.py <parent_directory> [CF_version] [--csv] [--excel]
```

### Arguments
- `<parent_directory>` → Path to the parent folder containing subdirectories with NetCDF files.  
- `[CF_version]` → (Optional) CF version to check against, e.g. `1.8`.  
- `--csv` → Export summary results to a timestamped CSV file.  
- `--excel` → Export summary results to a timestamped Excel file.  

### Examples

Check all subdirectories and just log results:
```bash
python check_cf_compliance.py /data/netcdf_files
```

Check against CF-1.8:
```bash
python check_cf_compliance.py /data/netcdf_files 1.8
```

Check and export results to CSV:
```bash
python check_cf_compliance.py /data/netcdf_files 1.8 --csv
```

Check and export results to Excel:
```bash
python check_cf_compliance.py /data/netcdf_files --excel
```

Export both CSV and Excel:
```bash
python check_cf_compliance.py /data/netcdf_files 1.8 --csv --excel
```

---

## Outputs

1. **Log files**
   - `cf_compliance_details.log` → detailed `cfchecks` output for each file.  
   - `cf_compliance_summary.log` → summary table with counts of Errors, Warnings, Info.  

2. **Optional exports**
   - `cf_compliance_summary_YYYYMMDD_HHMMSS.csv` → machine-readable CSV summary.  
   - `cf_compliance_summary_YYYYMMDD_HHMMSS.xlsx` → Excel summary.  

Example summary table in logs:

```
| File                 | Errors | Warnings | Info | Result |
|----------------------|--------|----------|------|--------|
| huragl10S1_200001.nc |      0 |        3 |    0 | PASS   |
| windagl80S1_199901.nc|      1 |        5 |    0 | FAIL   |
```

---

## How It Works

1. **File selection**: For each subdirectory in the parent folder, the script takes the **alphabetically first NetCDF file** (`*.nc`).  
2. **Run CF-Checker**: Calls the `cfchecks` utility for each selected file.  
3. **Parse results**: Extracts the `ERRORS detected`, `WARNINGS given`, and `INFORMATION messages` lines from the CF-Checker output.  
4. **Logging**:
   - Writes all raw CF-Checker output into `cf_compliance_details.log`.  
   - Builds a concise summary table and writes it into `cf_compliance_summary.log`.  
5. **Export** (optional): Saves the summary table into CSV/Excel with timestamped filenames for reproducibility.  

---

## Roadmap / Extensions

Future improvements could include:
- Adding `--details-only` or `--summary-only` flags to control which logs are written.  
- Processing **all files** in each subdirectory, not just the first.  
- Supporting parallel execution for large datasets.