Skip to content

API Documentation for flyfield

This documentation provides a detailed reference to the flyfield Python API for programmatically working with PDF forms that use white box placeholders.

Overview

The flyfield API automates workflows including:

  • Extracting white box placeholders from vector PDFs
  • Filtering, deduplicating, and grouping detected regions into logical fields
  • Generating interactive AcroForm fields in PDFs programmatically
  • Filling form fields with data from CSV files
  • Capturing data back from filled PDFs into CSV

The API is modular and can be imported into Python projects, offering programmable control beyond the CLI.


Key Modules and Functions

1. Extraction (extract.py)

  • extract_boxes(pdf_path: str) -> List[dict] Extracts all white boxes from a PDF that match config.TARGET_COLOUR (pure white by default).
    • Converts coordinates to the standard bottom-left PDF system.
    • Returns a list of box dictionaries with metadata such as page_num, bbox, chars, and field_type.
  • filter_boxes(page: fitz.Page, boxes: List[dict]) -> List[dict] Filters raw boxes by:
    • Size (MIN_BOX_HEIGHT, MAX_BOX_HEIGHT)
    • Allowed text (utils.allowed_text)
    • Retains only candidate placeholders.
  • remove_duplicates(boxes: List[dict]) -> List[dict] Removes duplicates based on rounded coordinates on each page.
  • sort_boxes(boxes: List[dict], decimal_places: int=0) -> List[dict] Sorts results top-to-bottom, then left-to-right.
  • process_boxes(pdf_path: str, csv_path: str) -> Dict[int, List[dict]] Full extraction pipeline:

  • Extract → Filter → Deduplicate → Sort

  • Compute layout fields (calculate_layout_fields)
  • Assign numeric block types (assign_numeric_blocks)
  • Save annotated results to CSV Returns a dictionary keyed by page_num.

2. Layout (layout.py)

  • calculate_layout_fields(boxes: List[dict]) -> Dict[int, List[dict]] Annotates box rows with:
    • IDs, line numbers, block grouping
    • Block length/width
    • Concatenated block_fill text or formatted money values
  • assign_numeric_blocks(page_dict: Dict[int, List[dict]]) -> Dict[int, List[dict]] Merges sequential numeric blocks (e.g. ### ### ## patterns) into currency fields.
    • Assigns "Currency" or "CurrencyDecimal" where applicable.

3. CSV I/O (io_utils.py)

  • load_boxes_from_csv(csv_path: str) -> Dict[int, List[dict]] Reads CSV data into a page dictionary for further processing.
  • write_csv(data, csv_path: str) -> None Writes box/page data back to CSV in canonical format.
    • Ensures only one fill column is stored (block_fill or fallback fill).
  • read_csv_rows(filename: str) -> List[dict] Reads CSV into dictionaries, parsing numeric fills with parse_money_space or parse_implied_decimal.
  • save_pdf_form_data_to_csv(pdf_path: str, csv_path: str, boxes: dict=None) -> None Captures filled AcroForm values from a PDF and writes them to CSV.
    • Applies NUMERIC_FIELD_TYPES parsing rules.
    • Uppercases strings where applicable.

4. Markup and Field Scripts (markup_and_fields.py)

  • markup_pdf(pdf_path: str, page_dict: Dict[int,List[dict]], output_pdf: str, mark_color=(0,0,1)) -> None Creates a debug PDF marking detected fields with circles and rotated field codes.
  • generate_form_fields_script(csv_path: str, input_pdf: str, output_pdf: str, script_path: str) -> str Generates a standalone Python script that adds AcroForm fields to a given PDF, based on detected CSV data.
  • run_standalone_script(script_path: str) -> None Executes the generated script in a subprocess to apply fields.
  • run_fill_pdf_fields(csv_path: str, output_pdf: str, template_pdf: str, generator_script: str, boxes: dict=None) -> None Generates and runs a filler script that populates an interactive PDF with values from a CSV.
    • Supports monetary formatting via format_money_space.
    • Supports normalization of Currency/CurrencyDecimal values by stripping non-digits.

5. Utilities (utils.py)

  • add_suffix_to_filename(filename: str, suffix: str) -> str Adds a suffix before the file extension.
  • colour_match(color: Tuple, target_color=(1,1,1), tol=1e-3) -> bool Compares normalized RGB colors with tolerance.
  • int_to_rgb(color_int: int) -> Tuple[float,float,float] Converts an integer 0xRRGGBB color to normalized floats.
  • clean_fill_string(line_text: str) -> str Removes single spaces but preserves aligned spacing.
  • allowed_text(text: str, field_type: Optional[str]) -> Tuple[bool, Optional[str]] Checks whether a string value inside a field is allowed (filters out pre-printed text).
  • format_money_space(amount: Union[float,int], decimal=True) -> str Formats numeric values with:
    • Space as thousand separator
    • Space as decimal marker (if decimal=True)
  • parse_money_space(s: str, decimal=True) -> Union[int,float] Parses strings formatted above back into numbers.
  • parse_implied_decimal(s: str) -> float Parses numbers treating the last two digits as cents.
  • parse_pages(pages_str: str) -> List[int] Parses "1,3-5,7" into [1,3,4,5,7].
  • conditional_merge_list(main_list, ref_list, match_key, keys_to_merge) Merges keys from a reference list into a main list when values of match_key match.

Field Data Structure

flyfield represents form fields as dictionaries (not classes):

Key Type Description
code str Unique identifier (page-line-block naming scheme)
page_num int PDF page number (1-based)
x0,y0,x1,y1 float Bounding box coordinates (PDF bottom-left system)
left, right float Rounded left/right coordinates
top, bottom float Rounded positions
line int Line number on page
block int Block number within line
block_length int Number of boxes in block
block_width float Width of block in points
field_type str One of "Dollars", "DollarCents", "Currency", etc.
chars str Non-black overlay text extracted
fill str/num Overlay text (user values, may be pre-filled)
block_fill str/num Aggregated/normalized block fill

Example Usage

from flyfield.extract import process_boxes
from flyfield.io_utils import save_pdf_form_data_to_csv
from flyfield.markup_and_fields import  run_fill_pdf_fields
from flyfield import config

# Process boxes and save CSV
page_dict = process_boxes("example.pdf", "example.csv")

# Generate a markup PDF
from flyfield.markup_and_fields import markup_pdf
markup_pdf("example.pdf", page_dict, "example-markup.pdf")

# Fill fields with values from another CSV
run_fill_pdf_fields("example.csv",
                    "example-filled.pdf",
                    "example-fields.pdf",
                    "example-filler.py",
                    page_dict)

# Capture back to CSV after filling
save_pdf_form_data_to_csv("example-filled.pdf", "example-capture.csv", page_dict)

Info

  • flyfield depends on PyMuPDF (fitz) for box extraction and markup, and PyPDFForm for form field creation and filling.
  • Monetary/Currency parsing is opinionated.
  • All generated scripts (-field-generator.py, -filler.py) are standalone and reusable in case of workflow adjustments.
  • Debug logging (--debug) outputs stepwise CSVs for troubleshooting.

Further Resources


Automatic documentation from sources by mkdocstrings.

Core Modules

flyfield.extract

Extraction functions for PDF processing.

Provides methods to extract PDF box data and text.

extract_boxes(pdf_path)

Extract filled rectangles (boxes) from a PDF matching a target color.

Parameters:

Name Type Description Default
pdf_path str

Path to the input PDF file.

required

Returns:

Type Description
List[Dict]

list of dict: Each dict details box coordinates (PDF coordinates, origin bottom-left),

List[Dict]

page number, and other metadata for detected boxes.

Notes

Converts PyMuPDF coordinates (origin top-left) to PDF standard bottom-left origin. Only boxes filled with the target color are extracted.

Source code in flyfield/extract.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
def extract_boxes(pdf_path: str) -> List[Dict]:
    """
    Extract filled rectangles (boxes) from a PDF matching a target color.

    Args:
        pdf_path (str): Path to the input PDF file.

    Returns:
        list of dict: Each dict details box coordinates (PDF coordinates, origin bottom-left),
        page number, and other metadata for detected boxes.

    Notes:
        Converts PyMuPDF coordinates (origin top-left) to PDF standard bottom-left origin.
        Only boxes filled with the target color are extracted.
    """
    boxes = []
    try:
        with fitz.open(pdf_path) as doc:
            for page_num in range(1, len(doc) + 1):
                try:
                    page = doc[page_num - 1]
                except IndexError:
                    logger.warning(f"Page {page_num} not found in document.")
                    continue
                page_height = page.rect.height
                for drawing in page.get_drawings():
                    rect = drawing.get("rect")
                    fill_color = drawing.get("fill")
                    if rect and colour_match(fill_color, target_color=TARGET_COLOUR):
                        # Convert PyMuPDF page coordinates (origin top-left)
                        # to PDF coordinate system (origin bottom-left)

                        pdf_y0 = page_height - rect.y1
                        pdf_y1 = page_height - rect.y0
                        boxes.append(
                            {
                                "page_num": page_num,
                                "x0": rect.x0,
                                "y0": pdf_y0,
                                "x1": rect.x1,
                                "y1": pdf_y1,
                                "left": round(rect.x0, 2),
                                "bottom": round(pdf_y0, 2),
                                "right": round(rect.x1, 2),
                                "top": round(pdf_y1, 2),
                                "chars": "",
                                "field_type": None,
                            }
                        )
    except Exception as e:
        logger.error(f"Could not open PDF file {pdf_path}: {e}")
    return boxes

filter_boxes(page, boxes)

Filter a list of boxes on a PDF page based on height and allowed text.

Parameters:

Name Type Description Default
page Page

PyMuPDF page object.

required
boxes list of dict

List of box dictionaries to filter.

required

Returns:

Type Description
List[Dict]

list of dict: Filtered boxes that meet size and allowed text criteria.

Notes

Excludes boxes outside valid height ranges or with disallowed text.

Source code in flyfield/extract.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def filter_boxes(page: fitz.Page, boxes: List[Dict]) -> List[Dict]:
    """
    Filter a list of boxes on a PDF page based on height and allowed text.

    Args:
        page (fitz.Page): PyMuPDF page object.
        boxes (list of dict): List of box dictionaries to filter.

    Returns:
        list of dict: Filtered boxes that meet size and allowed text criteria.

    Notes:
        Excludes boxes outside valid height ranges or with disallowed text.
    """
    filtered = []
    page_height = page.rect.height
    black = (0, 0, 0)  # RGB for black text matching

    for box in boxes:
        height = box.get("y1", 0) - box.get("y0", 0)
        if height < MIN_BOX_HEIGHT or height > MAX_BOX_HEIGHT:
            continue
        # Convert box coordinates to PyMuPDF's coordinate system for clipping

        pymupdf_y0 = page_height - box["y1"]
        pymupdf_y1 = page_height - box["y0"]
        clip_rect = fitz.Rect(box["x0"], pymupdf_y0, box["x1"], pymupdf_y1)

        text_dict = page.get_text("dict", clip=clip_rect)

        black_text_parts = []
        non_black_text_parts = []

        for block in text_dict.get("blocks", []):
            for line in block.get("lines", []):
                for span in line.get("spans", []):
                    span_text = span.get("text", "").strip()
                    if not span_text:
                        continue
                    span_color = span.get("color")
                    rgb = None
                    if span_color is not None:
                        if isinstance(span_color, int):
                            rgb = int_to_rgb(span_color)
                        elif isinstance(span_color, str):
                            try:
                                rgb = fitz.utils.getColor(span_color)
                            except Exception:
                                rgb = None
                    if rgb and colour_match(rgb, target_color=black):
                        black_text_parts.append(span_text)
                    else:
                        non_black_text_parts.append(span_text)
        fill_text = "".join(black_text_parts)
        box_text = "".join(non_black_text_parts)

        allowed, detected_field_type = allowed_text(
            box_text, field_type=box.get("field_type")
        )
        if box_text and not allowed:
            continue
        box["field_type"] = detected_field_type
        box["chars"] = box_text
        box["fill"] = fill_text
        filtered.append(box)
    return filtered

remove_duplicates(boxes)

Remove duplicate boxes on the same page based on rounded coordinates.

Parameters:

Name Type Description Default
boxes list of dict

List of box dictionaries.

required

Returns:

Type Description
List[Dict]

list of dict: Boxes with duplicates removed.

Source code in flyfield/extract.py
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
def remove_duplicates(boxes: List[Dict]) -> List[Dict]:
    """
    Remove duplicate boxes on the same page based on rounded coordinates.

    Args:
        boxes (list of dict): List of box dictionaries.

    Returns:
        list of dict: Boxes with duplicates removed.
    """
    page_groups = defaultdict(list)
    for box in boxes:
        page_groups[box["page_num"]].append(box)
    cleaned = []
    for _page_num, page_boxes in page_groups.items():
        seen = set()
        for box in page_boxes:
            key = (
                round(box["x0"], 3),
                round(box["y0"], 3),
                round(box["x1"], 3),
                round(box["y1"], 3),
            )
            if key not in seen:
                seen.add(key)
                cleaned.append(box)
    return cleaned

sort_boxes(boxes, decimal_places=0)

Sort boxes by page number, top-to-bottom (descending), then left-to-right.

Parameters:

Name Type Description Default
boxes list of dict

List of boxes to sort.

required
decimal_places int

Precision for vertical grouping (bottom coordinate rounding).

0

Returns:

Type Description
List[Dict]

list of dict: Sorted boxes.

Source code in flyfield/extract.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
def sort_boxes(boxes: List[Dict], decimal_places: int = 0) -> List[Dict]:
    """
    Sort boxes by page number, top-to-bottom (descending), then left-to-right.

    Args:
        boxes (list of dict): List of boxes to sort.
        decimal_places (int): Precision for vertical grouping (bottom coordinate rounding).

    Returns:
        list of dict: Sorted boxes.
    """
    return sorted(
        boxes,
        key=lambda b: (b["page_num"], -round(b["bottom"], decimal_places), b["left"]),
    )

process_boxes(pdf_path, csv_path)

Full pipeline to extract, filter, deduplicate, sort, layout annotate, and save boxes from a PDF.

Parameters:

Name Type Description Default
pdf_path str

Path to input PDF file.

required
csv_path str

Path to output CSV file for annotated box data.

required

Returns:

Name Type Description
dict Dict[int, List[Dict]]

Dictionary keyed by page number containing processed boxes with layout metadata.

Notes
  • Extract filled white boxes matching TARGET_COLOUR.
  • Filter boxes by valid height and allowed text content.
  • Remove duplicate boxes by coordinate proximity.
  • Sort boxes by page, vertical then horizontal order.
  • Compute layout fields such as IDs, block grouping, lines.
  • Assign numeric block field types using heuristics.
  • Write the full annotated box data to CSV.
Source code in flyfield/extract.py
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
def process_boxes(pdf_path: str, csv_path: str) -> Dict[int, List[Dict]]:
    """
    Full pipeline to extract, filter, deduplicate, sort, layout annotate, and save boxes from a PDF.

    Args:
        pdf_path (str): Path to input PDF file.
        csv_path (str): Path to output CSV file for annotated box data.

    Returns:
        dict: Dictionary keyed by page number containing processed boxes with layout metadata.

    Notes:
        - Extract filled white boxes matching TARGET_COLOUR.
        - Filter boxes by valid height and allowed text content.
        - Remove duplicate boxes by coordinate proximity.
        - Sort boxes by page, vertical then horizontal order.
        - Compute layout fields such as IDs, block grouping, lines.
        - Assign numeric block field types using heuristics.
        - Write the full annotated box data to CSV.
    """
    logger.info(f"Extracting boxes from PDF: {pdf_path}")
    boxes = extract_boxes(pdf_path)
    logger.info(f"Extracted {len(boxes)} white boxes.")

    try:
        doc = fitz.open(pdf_path)
    except Exception as e:
        logger.error(f"Error opening input PDF: {e}")
        return defaultdict(list)
    if logger.isEnabledFor(logging.DEBUG):
        write_csv(boxes, csv_path.replace(".csv", "-extracted.csv"))
    filtered_boxes = []
    for page_num in range(1, len(doc) + 1):
        page_boxes = [p for p in boxes if p["page_num"] == page_num]
        filtered_boxes.extend(filter_boxes(doc[page_num - 1], page_boxes))
    doc.close()

    if logger.isEnabledFor(logging.DEBUG):
        write_csv(filtered_boxes, csv_path.replace(".csv", "-grouped.csv"))
    filtered_boxes = remove_duplicates(filtered_boxes)
    filtered_boxes = sort_boxes(filtered_boxes, decimal_places=-1)

    if logger.isEnabledFor(logging.DEBUG):
        write_csv(filtered_boxes, csv_path.replace(".csv", "-filtered.csv"))
    page_dict = calculate_layout_fields(filtered_boxes)

    if logger.isEnabledFor(logging.DEBUG):
        write_csv(filtered_boxes, csv_path.replace(".csv", "-layout.csv"))
    page_dict = assign_numeric_blocks(page_dict)

    write_csv(page_dict, csv_path)
    return page_dict

flyfield.io_utils

Utility functions for input/output operations.

Includes CSV reading/writing and data transformation helpers.

load_boxes_from_csv(csv_path)

Load boxes data from a CSV into a dictionary keyed by page number,

applying the specified types to each column in the CSV.

Parameters:

Name Type Description Default
csv_path str

Path to the CSV file.

required

Returns:

Type Description
dict[int, list[dict]]

dict[int, list[dict]]: Dictionary mapping page number (int)

dict[int, list[dict]]

to a list of box dictionaries with appropriately typed values.

Description

The CSV is expected to contain columns:

  • page_num (int), id (int), x0 (float), y0 (float), x1 (float), y1 (float),
  • left (float), top (float), right (float), bottom (float),
  • height (float), width (float), pgap (float), gap (float),
  • line (int), block (int), block_length (int), block_width (float),
  • code (str), field_type (str), chars (str), fill (str)

Each value from the CSV is converted from string to the appropriate type. Empty or missing values are converted to None for numeric types and empty string for strings. Conversion errors are caught and logged; original strings are kept in those cases.

Source code in flyfield/io_utils.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def load_boxes_from_csv(csv_path: str) -> dict[int, list[dict]]:
    """
    Load boxes data from a CSV into a dictionary keyed by page number,

    applying the specified types to each column in the CSV.

    Args:
        csv_path (str): Path to the CSV file.

    Returns:
        dict[int, list[dict]]: Dictionary mapping page number (int)
        to a list of box dictionaries with appropriately typed values.

    Description:
        The CSV is expected to contain columns:

        - page_num (int), id (int), x0 (float), y0 (float), x1 (float), y1 (float),
        - left (float), top (float), right (float), bottom (float),
        - height (float), width (float), pgap (float), gap (float),
        - line (int), block (int), block_length (int), block_width (float),
        - code (str), field_type (str), chars (str), fill (str)

        Each value from the CSV is converted from string to the appropriate type.
        Empty or missing values are converted to None for numeric types and empty string for strings.
        Conversion errors are caught and logged; original strings are kept in those cases.
    """
    logger.info(f"Reading blocks from CSV: {csv_path}")
    rows = read_csv_rows(csv_path)  # Should return list of dict[str, str]

    def convert_value(value: str, to_type):
        if value is None or value == "":
            if to_type in (int, float):
                return None
            return ""
        try:
            return to_type(value)
        except Exception as e:
            logger.warning(f"Failed to convert value '{value}' to {to_type}: {e}")
            return value  # fallback: keep original string

    page_dict = defaultdict(list)
    for row in rows:
        typed_row = {
            col: convert_value(row.get(col), col_type)
            for col, col_type in COLUMN_TYPES.items()
        }
        if typed_row.get("page_num") is not None:
            page_dict[typed_row["page_num"]].append(typed_row)
    return page_dict

write_csv(boxes_or_page_dict, csv_path)

Write box data or page dictionary data to CSV file.

Saves only one 'fill' column: - Uses 'block_fill' if present, - Otherwise falls back to original 'fill'.

Parameters:

Name Type Description Default
boxes_or_page_dict list or dict

List of box dicts or dict keyed by page containing lists of boxes.

required
csv_path str

Output CSV file path.

required
Source code in flyfield/io_utils.py
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
def write_csv(
    boxes_or_page_dict: Union[List[Dict], Dict[int, List[Dict]]], csv_path: str
) -> None:
    """
    Write box data or page dictionary data to CSV file.

    Saves only one 'fill' column:
        - Uses 'block_fill' if present,
        - Otherwise falls back to original 'fill'.

    Args:
        boxes_or_page_dict (list or dict): List of box dicts or dict keyed by page containing lists of boxes.
        csv_path (str): Output CSV file path.
    """
    if isinstance(boxes_or_page_dict, dict):
        all_boxes = [
            box
            for boxes in boxes_or_page_dict.values()
            if boxes is not None
            for box in boxes
        ]
    else:
        all_boxes = boxes_or_page_dict or []
    try:
        with open(csv_path, "w", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(CSV_HEADER)
            for box in all_boxes:
                height = round(box.get("y1", 0) - box.get("y0", 0), 1)
                width = round(box.get("x1", 0) - box.get("x0", 0), 1)
                fill_value = box.get("block_fill")
                if fill_value is None:
                    fill_value = box.get("fill", "")
                field_type = box.get("field_type")
                # Convert monetary fill values back to float/int as appropriate

                if (
                    field_type in ("Dollars", "DollarCents", "CurrencyDecimal")
                    and fill_value
                ):
                    decimal = field_type in ("DollarCents", "CurrencyDecimal")
                    try:
                        fill_value = parse_money_space(fill_value, decimal=decimal)
                    except Exception as e:
                        logger.warning(
                            f"Failed to parse money from fill_value '{fill_value}' for field_type '{field_type}': {e}"
                        )
                row = [
                    box.get("page_num", ""),
                    box.get("id", ""),
                    box.get("x0", ""),
                    box.get("y0", ""),
                    box.get("x1", ""),
                    box.get("y1", ""),
                    box.get("left", ""),
                    box.get("top", ""),
                    box.get("right", ""),
                    box.get("bottom", ""),
                    height,
                    width,
                    box.get("pgap", ""),
                    box.get("gap", ""),
                    box.get("line", ""),
                    box.get("block", ""),
                    box.get("block_length", ""),
                    box.get("block_width", ""),
                    box.get("code", ""),
                    box.get("field_type", ""),
                    box.get("chars", ""),
                    fill_value,
                ]

                writer.writerow(row)
    except Exception as e:
        logger.error(f"Failed to write CSV {csv_path}: {e}")

read_csv_rows(filename)

Read CSV rows into a list, converting typed fields and normalizing monetary fills.

Parameters:

Name Type Description Default
filename str

Path to CSV file.

required

Returns:

Type Description
List[Dict[str, str]]

list of dict: Rows with typed values and block_fill normalized.

Source code in flyfield/io_utils.py
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
def read_csv_rows(filename: str) -> List[Dict[str, str]]:
    """
    Read CSV rows into a list, converting typed fields and normalizing monetary fills.

    Args:
        filename (str): Path to CSV file.

    Returns:
        list of dict: Rows with typed values and block_fill normalized.
    """
    rows = []
    currency_field_types = {"Dollars", "DollarCents", "Currency", "CurrencyDecimal"}

    try:
        with open(filename, newline="", encoding="utf-8") as f:
            reader = csv.DictReader(f)
            headers = reader.fieldnames or []
            is_extraction_csv = "page_num" in headers

            for row in reader:
                if is_extraction_csv:
                    try:
                        # Convert page_num, line, gap, block_length, height, width fields to correct types

                        row["page_num"] = (
                            int(row["page_num"]) if row["page_num"].strip() else None
                        )
                        row["line"] = int(row["line"]) if row["line"].strip() else None
                        row["gap"] = float(row["gap"]) if row["gap"].strip() else 0.0
                        row["block_length"] = (
                            int(row["block_length"])
                            if row["block_length"].strip()
                            else 0
                        )
                        row["height"] = float(row.get("height", 0))
                        row["width"] = float(row.get("width", 0))
                    except (ValueError, KeyError) as e:
                        logger.warning(f"Skipping row due to value error: {e}")
                        continue
                # Rearrange 'fill' to 'block_fill' with formatted monetary fields

                if "fill" in row:
                    fill_value = row["fill"]
                    field_type = row.get("field_type", "")

                    if field_type in currency_field_types and fill_value.strip():
                        if (
                            field_type in ("DollarCents", "CurrencyDecimal")
                            and " " not in fill_value
                        ):
                            # Use implied decimal parser for no explicit decimal separator

                            try:
                                amount = parse_implied_decimal(fill_value)
                                fill_value = format_money_space(amount, decimal=True)
                            except Exception as e:
                                logger.warning(
                                    f"Failed to parse implied decimal fill '{fill_value}' for field_type '{field_type}': {e}"
                                )
                        else:
                            # Use existing parser for explicit decimal formatting

                            decimal = field_type in ("DollarCents", "CurrencyDecimal")
                            try:
                                amount = parse_money_space(fill_value, decimal=decimal)
                                fill_value = format_money_space(amount, decimal=decimal)
                            except Exception as e:
                                logger.warning(
                                    f"Failed to parse/format fill '{fill_value}' for field_type '{field_type}': {e}"
                                )
                        row["block_fill"] = fill_value
                        del row["fill"]
                rows.append(row)
    except Exception as e:
        logger.error(f"Failed to read CSV rows from {filename}: {e}")
    return rows

save_pdf_form_data_to_csv(pdf_path, csv_path, boxes=None)

Extract PDF form data, convert string values to uppercase and numeric fields to raw numbers, then save as CSV.

Parameters:

Name Type Description Default
pdf_path str

Input PDF form file path.

required
csv_path str

Output CSV path.

required
boxes dict

Boxes metadata to enrich form data.

None

Returns:

Type Description
None

None

Source code in flyfield/io_utils.py
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
def save_pdf_form_data_to_csv(
    pdf_path: str, csv_path: str, boxes: Optional[Dict[int, List[dict]]] = None
) -> None:
    """
    Extract PDF form data, convert string values to uppercase and numeric fields to raw numbers, then save as CSV.

    Args:
        pdf_path (str): Input PDF form file path.
        csv_path (str): Output CSV path.
        boxes (dict, optional): Boxes metadata to enrich form data.

    Returns:
        None
    """
    data = []
    try:
        # Extract form data; convert string values to uppercase where applicable

        form_data = {
            k: v.upper() if isinstance(v, str) else str(v)
            for k, v in PdfWrapper(pdf_path).data.items()
            if v is not None and str(v).strip() != "" and str(v).strip("0") != ""
        }
        # Convert raw data dict to list of dicts with explicit 'code' and 'value' keys

        data = [{"code": k, "value": v} for k, v in form_data.items()]
    except Exception as e:
        logger.error(f"Failed to extract data from {pdf_path}: {e}")
    logger.debug(f"Extracted PDF form data (type={type(data)}), count={len(data)}")

    if boxes:
        flat_boxes = [entry for sublist in boxes.values() for entry in sublist]
        conditional_merge_list(data, flat_boxes, "code", ["field_type"])
    try:
        with open(csv_path, mode="w", newline="", encoding="utf-8") as file:
            writer = csv.writer(file)
            # Write CSV header

            writer.writerow(["code", "fill"])

            # Write each field record as a CSV row

            for field in data:
                code = field.get("code")
                fill_value = field.get("value")
                field_type = field.get("field_type")

                logger.debug(
                    f"Code: {code}, Raw Value: {fill_value}, Field Type: {field_type}"
                )
                if field_type in NUMERIC_FIELD_TYPES and isinstance(fill_value, str):
                    try:
                        if field_type == "CurrencyDecimal":
                            amount = parse_implied_decimal(fill_value)
                            fill_value = str(amount)  # Save raw number, not formatted!
                            logger.debug(f"Parsed CurrencyDecimal: {fill_value}")
                        elif field_type in ("DollarCents", "Dollars"):
                            decimal = field_type == "DollarCents"
                            amount = parse_money_space(fill_value, decimal=decimal)
                            fill_value = str(amount)  # Save raw number, not formatted!
                            logger.debug(f"Parsed {field_type}: {fill_value}")
                    except Exception as e:
                        logger.warning(
                            f"Failed parsing money value '{fill_value}' for field '{code}': {e}"
                        )
                writer.writerow([code, fill_value])
    except Exception as e:
        logger.error(f"Failed to write CSV file {csv_path}: {e}")

flyfield.layout

Layout processing for PDFs.

Calculates layout box positions and formatting.

calculate_layout_fields(boxes)

Annotate boxes with layout metadata including IDs, lines, blocks,

block dimensions, monetary formatting, calculate block dimensions and concatenated fill text per block.

Parameters:

Name Type Description Default
boxes list

List of boxes sorted by page and vertical order.

required

Returns:

Name Type Description
dict DefaultDict[int, List[Dict]]

Mapping page numbers to lists of annotated boxes.

Notes
  • Vertical tolerance epsilon controls grouping boxes into the same line.
  • Blocks are formed by grouping boxes separated by large gaps (GAP_THRESHOLD).
  • Monetary fills are formatted with spaces and decimals where appropriate.
Source code in flyfield/layout.py
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
def calculate_layout_fields(boxes: List[Dict]) -> DefaultDict[int, List[Dict]]:
    """
    Annotate boxes with layout metadata including IDs, lines, blocks,

    block dimensions, monetary formatting, calculate block dimensions and
    concatenated fill text per block.

    Args:
        boxes (list): List of boxes sorted by page and vertical order.

    Returns:
        dict: Mapping page numbers to lists of annotated boxes.

    Notes:
        - Vertical tolerance epsilon controls grouping boxes into the same line.
        - Blocks are formed by grouping boxes separated by large gaps (GAP_THRESHOLD).
        - Monetary fills are formatted with spaces and decimals where appropriate.
    """
    epsilon = 1  # Vertical tolerance for grouping boxes into the same line
    idx = 0
    current_page = None
    line_counter = 1
    while idx < len(boxes):
        page_num = boxes[idx]["page_num"]
        if page_num != current_page:
            current_page = page_num
            line_counter = 1
        block_id_counter = 1
        # Initialize first box in a new line and block

        boxes[idx].update(
            {
                "id": idx + 1,
                "line": line_counter,
                "block_start": block_id_counter,
                "block": block_id_counter,
                "code": f"{page_num}-{line_counter}-{block_id_counter}",
                "pgap": None,  # Gap before this box (none for first)
            }
        )
        block_start = idx
        j = idx + 1
        # Group boxes horizontally on the same line by bottom alignment and gap thresholds

        while (
            j < len(boxes)
            and boxes[j]["page_num"] == page_num
            and abs(boxes[j]["bottom"] - boxes[idx]["bottom"]) < epsilon
        ):
            boxes[j]["id"] = j + 1
            boxes[j]["line"] = line_counter
            prev_gap = round(boxes[j]["x0"] - boxes[j - 1]["x1"], 1)
            boxes[j]["pgap"] = prev_gap
            boxes[j - 1]["gap"] = prev_gap
            if prev_gap >= GAP_THRESHOLD:
                # Close current block and start a new block

                end_idx = j - 1
                block_length = (end_idx - block_start) + 1
                block_width = round(boxes[end_idx]["x1"] - boxes[block_start]["x0"], 1)
                boxes[block_start]["block_length"] = block_length
                boxes[block_start]["block_width"] = block_width
                current_box = boxes[block_start]
                if current_box.get("field_type") not in ("DollarCents", "Dollars"):
                    raw_fill = " ".join(
                        box.get("fill", "") for box in boxes[block_start : end_idx + 1]
                    )
                    boxes[block_start]["block_fill"] = clean_fill_string(raw_fill)
                else:
                    decimal = current_box.get("field_type") == "DollarCents"
                    fill_val = current_box.get("fill", "")
                    try:
                        if fill_val == "" or fill_val is None:
                            fill_val = 0
                        current_box["fill"] = format_money_space(fill_val, decimal)
                    except Exception as e:
                        logger.warning(f"Failed to format fill value '{fill_val}': {e}")
                        # fall back to original fill value if formatting fails

                        current_box["fill"] = fill_val
                block_id_counter += 1
                block_start = j
                boxes[j].update(
                    {
                        "block_start": block_id_counter,
                        "block": block_id_counter,
                        "code": f"{page_num}-{line_counter}-{block_id_counter}",
                    }
                )
            else:
                # Continue current block

                boxes[j].update(
                    {
                        "block_start": block_id_counter,
                        "block": block_id_counter,
                        "code": f"{page_num}-{line_counter}-{block_id_counter}",
                    }
                )
            j += 1
        # Close last block on the line

        end_idx = j - 1
        block_length = (end_idx - block_start) + 1
        block_width = round(boxes[end_idx]["x1"] - boxes[block_start]["x0"], 1)
        boxes[block_start]["block_length"] = block_length
        boxes[block_start]["block_width"] = block_width
        current_box = boxes[block_start]
        if current_box.get("field_type") not in ("DollarCents", "Dollars"):
            raw_fill = " ".join(
                box.get("fill", "") for box in boxes[block_start : end_idx + 1]
            )
            boxes[block_start]["block_fill"] = clean_fill_string(raw_fill)
        else:
            decimal = current_box.get("field_type") == "DollarCents"
            fill_val = current_box.get("fill", "")
            try:
                if fill_val == "" or fill_val is None:
                    fill_val = 0
                current_box["fill"] = format_money_space(fill_val, decimal)
            except Exception as e:
                logger.warning(f"Failed to format fill value '{fill_val}': {e}")
                current_box["fill"] = fill_val
        boxes[end_idx]["gap"] = None  # No gap after the last box in the line
        line_counter += 1
        idx = j
    block_id_counter = 1
    # Group boxes by page number, only include blocks with length >= 1, then sort by line and left coordinate

    page_dict = defaultdict(list)
    for box in boxes:
        if box.get("block_length", 0) >= 1:
            page_dict[box["page_num"]].append(box)
    for page_num in page_dict:
        page_dict[page_num].sort(key=lambda r: (r.get("line", 0), r.get("left", 0)))
    return page_dict

assign_numeric_blocks(page_dict)

Merge and assign numeric block types based on heuristics of adjacency and length.

Parameters:

Name Type Description Default
page_dict dict

Keyed by page number with boxes list.

required

Returns:

Name Type Description
dict DefaultDict[int, List[Dict]]

Updated page_dict with numeric block types assigned.

Notes

Modifies the page_dict in place:

  • Merges runs of adjacent blocks of length 3 if gaps between them are small.
  • Optionally prepends certain preceding blocks to runs.
  • Assigns field types "CurrencyDecimal" or "Currency" based on heuristics.
  • Aggregates block lengths, widths, and concatenates fill strings.
Source code in flyfield/layout.py
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
def assign_numeric_blocks(
    page_dict: DefaultDict[int, List[Dict]],
) -> DefaultDict[int, List[Dict]]:
    """
    Merge and assign numeric block types based on heuristics of adjacency and length.

    Args:
        page_dict (dict): Keyed by page number with boxes list.

    Returns:
        dict: Updated page_dict with numeric block types assigned.

    Notes:
        Modifies the page_dict in place:

        - Merges runs of adjacent blocks of length 3 if gaps between them are small.
        - Optionally prepends certain preceding blocks to runs.
        - Assigns field types "CurrencyDecimal" or "Currency" based on heuristics.
        - Aggregates block lengths, widths, and concatenates fill strings.
    """
    for page_num, rows in page_dict.items():
        rows.sort(key=lambda r: (r.get("line", 0), r.get("left", 0)))
        page_dict[page_num] = rows
        i = 0
        while i < len(rows):
            block_length = rows[i].get("block_length", 0)
            if block_length == 3:
                run = [rows[i]]
                j = i + 1
                # Collect consecutive blocks of length 3 separated by small gaps

                while j < len(rows):
                    next_block_length = rows[j].get("block_length", 0)
                    next_pgap = rows[j].get("pgap")
                    if (
                        next_block_length == 3
                        and next_pgap is not None
                        and 0 < next_pgap < 8
                    ):
                        run.append(rows[j])
                        j += 1
                    else:
                        break
                # Optionally prepend preceding block if conditions met

                if len(run) >= 2 and i > 0:
                    prev = rows[i - 1]
                    first_pgap = rows[i].get("pgap", 0)
                    if (
                        prev.get("block_length") in (1, 2)
                        and first_pgap is not None
                        and 1 <= first_pgap < 8
                    ):
                        run.insert(0, prev)
                        i -= 1
                next_idx = j
                next_block_length = (
                    rows[next_idx].get("block_length") if next_idx < len(rows) else None
                )
                next_gap = rows[next_idx].get("pgap") if next_idx < len(rows) else None
                if len(run) >= 2:
                    if (
                        next_idx < len(rows)
                        and next_block_length == 2
                        and next_gap is not None
                    ):
                        run.append(rows[next_idx])
                        run[0]["field_type"] = "CurrencyDecimal"
                        j += 1
                    else:
                        run[0]["field_type"] = "Currency"
                    # Aggregate block length and width for the merged block

                    block_length_sum = sum(
                        r.get("block_length", 0) for r in run if r.get("block_length")
                    )
                    run[0]["block_length"] = block_length_sum
                    first_left = min(r.get("left", float("inf")) for r in run)
                    last_left = max(r.get("left", float("-inf")) for r in run)
                    run[0]["block_width"] = (
                        last_left - first_left + run[-1]["block_width"]
                    )
                    fills = [
                        r.get("block_fill", "") for r in run if r.get("block_fill")
                    ]
                    run[0]["block_fill"] = "".join(fills).strip()
                    # Clear subordinate blocks lengths and fills

                    for r in run[1:]:
                        r["block_length"] = None
                        r["block_width"] = None
                        r["block_fill"] = None
                    i = j
                else:
                    i += 1
            else:
                i += 1
    return page_dict

flyfield.markup_and_fields

Functions for PDF markup and form field annotation.

markup_pdf(pdf_path, page_dict, output_pdf_path, mark_color=(0, 0, 1), mark_radius=1)

Mark PDF with circles and codes at block locations for debugging.

Parameters:

Name Type Description Default
pdf_path str

Input PDF file.

required
page_dict dict

Pages and boxes with layout info.

required
output_pdf_path str

Output marked PDF file path.

required
mark_color tuple

RGB float tuple for marker color.

(0, 0, 1)
mark_radius int or float

Radius of circle marks.

1

Returns:

Type Description
None

None

Source code in flyfield/markup_and_fields.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def markup_pdf(
    pdf_path: str,
    page_dict: Dict[int, List[Dict]],
    output_pdf_path: str,
    mark_color: Tuple[float, float, float] = (0, 0, 1),
    mark_radius: float = 1,
) -> None:
    """
    Mark PDF with circles and codes at block locations for debugging.

    Args:
        pdf_path (str): Input PDF file.
        page_dict (dict): Pages and boxes with layout info.
        output_pdf_path (str): Output marked PDF file path.
        mark_color (tuple): RGB float tuple for marker color.
        mark_radius (int or float): Radius of circle marks.

    Returns:
        None
    """
    try:
        doc = fitz.open(pdf_path)
    except Exception as e:
        logger.error(f"Failed to open PDF for markup: {e}")
        return
    for page_num, boxes in sorted(page_dict.items()):
        if config.PDF_PAGES and page_num not in config.PDF_PAGES:
            continue
        page = doc[page_num - 1]
        page_height = page.rect.height
        shape = page.new_shape()

        for box in boxes:
            # Only mark boxes that have a meaningful block_length

            if box.get("block_length") not in ("", 0, None):
                x, y_raw = box.get("x0"), box.get("y0")
                y = page_height - y_raw
                shape.draw_circle((x, y), mark_radius)

                point = fitz.Point(x + 4, y)
                shape.insert_text(
                    point,
                    str(box.get("code", "?")),
                    fontsize=8,
                    color=mark_color,
                    morph=(point, fitz.Matrix(1, 0, 0, 1, 0, 0).prerotate(45)),
                )
        shape.finish(color=mark_color, fill=None)
        shape.commit()
    try:
        doc.save(output_pdf_path)
    except Exception as e:
        logger.error(f"Failed to save output PDF: {e}")
    finally:
        doc.close()

adjust_form_boxes(row, width, block_length)

Adjust the position and width of form boxes depending on field type and block length.

Parameters:

Name Type Description Default
row dict

Box attributes.

required
width float

Original block width.

required
block_length int

Block length in contained boxes.

required

Returns:

Name Type Description
tuple Tuple[float, float, List[str]]

(adjusted x, adjusted width, list of extra args)

Source code in flyfield/markup_and_fields.py
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def adjust_form_boxes(
    row: Dict,
    width: float,
    block_length: int,
) -> Tuple[float, float, List[str]]:
    """
    Adjust the position and width of form boxes depending on field type and block length.

    Args:
        row (dict): Box attributes.
        width (float): Original block width.
        block_length (int): Block length in contained boxes.

    Returns:
        tuple: (adjusted x, adjusted width, list of extra args)
    """
    x = float(row["left"])
    field_type = row.get("field_type")
    extra_args = ["alignment=2"]

    if (
        block_length == 1
        and width > 14
        and field_type not in ("Currency", "CurrencyDecimal")
    ):
        # Reduce width by size of layout characters

        width_adjusted = width
        if field_type == "Dollars":
            width_adjusted -= 21
        elif field_type == "DollarCents":
            width_adjusted -= 4
        return x, max(0, width_adjusted), extra_args
    if field_type in ("Currency", "CurrencyDecimal"):
        gap_adj = (2 * GAP + GAP_GROUP) / 3 / 2
        gap_start = (gap_adj * (((block_length - 1) % 3) + 1)) / 2 + F
        if field_type == "CurrencyDecimal":
            gap_start += F * 2
        gap_end = gap_adj + F * 2 if field_type == "Currency" else (gap_adj * 3) / 2
    else:
        gap_adj = GAP
        gap_start = gap_end = gap_adj / 2 + F
        extra_args[0] = "alignment=0"
    x -= gap_start
    width_adjusted = width + gap_start + gap_end
    extra_args += [
        f"max_length={block_length}" if block_length else "max_length=None",
        "comb=True",
    ]
    return x, max(0, width_adjusted), extra_args

generate_form_fields_script(csv_path, input_pdf, output_pdf_with_fields, script_path)

Generate a standalone Python script to create PDF form fields from CSV block data.

Parameters:

Name Type Description Default
csv_path str

CSV data path.

required
input_pdf str

Input PDF to annotate.

required
output_pdf_with_fields str

Output annotated PDF.

required
script_path str

Output script file path.

required

Returns:

Name Type Description
str str

Path to the generated script file.

Source code in flyfield/markup_and_fields.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
def generate_form_fields_script(
    csv_path: str,
    input_pdf: str,
    output_pdf_with_fields: str,
    script_path: str,
) -> str:
    """
    Generate a standalone Python script to create PDF form fields from CSV block data.

    Args:
        csv_path (str): CSV data path.
        input_pdf (str): Input PDF to annotate.
        output_pdf_with_fields (str): Output annotated PDF.
        script_path (str): Output script file path.

    Returns:
        str: Path to the generated script file.
    """
    lines = [
        "from PyPDFForm import Fields, PdfWrapper",
        f'pdf = PdfWrapper("{input_pdf}")',
    ]
    try:
        with open(csv_path, newline="", encoding="utf-8") as f:
            reader = csv.DictReader(f)
            current_page = None
            for row in reader:
                page_number = int(row["page_num"])

                # Skip rows whose page number is not in PDF_PAGES if PDF_PAGES filter is set

                if config.PDF_PAGES and page_number not in config.PDF_PAGES:
                    continue
                code = row["code"]
                if (
                    not code
                    or row["block_length"] in ("", "0")
                    or row.get("field_type") == "Skip"
                ):
                    continue
                if page_number != current_page:
                    lines.append(f'print("Starting page {page_number}...", flush=True)')
                    current_page = page_number
                block_length = (
                    int(float(row["block_length"]))
                    if row["block_length"] not in ("", "0")
                    else 0
                )
                width = (
                    float(row["block_width"])
                    if row["block_width"] not in ("", "0")
                    else 0
                )
                y, height = float(row["bottom"]), float(row.get("height", 0))
                x, width_adjusted, extra_args = adjust_form_boxes(
                    row, width, block_length
                )
                sanitized_code = re.sub(r"[^\w\-_]", "_", code)
                base_args = [
                    #                    'widget_type="text"',
                    f'name="{sanitized_code}"',
                    f"page_number={page_number}",
                    f"x={x:.2f}",
                    f"y={y:.2f}",
                    f"height={height:.2f}",
                    f"width={width_adjusted:.2f}",
                    "bg_color=(0,0,0,0)",
                    "border_color=(0,0,0,0)",
                    "border_width=0",
                ]
                args = [*base_args, *extra_args]
                lines.append(f"pdf.create_field(Fields.TextField({', '.join(args)}))")
            lines.extend(
                [
                    f'pdf.write("{output_pdf_with_fields}")',
                    'print("Created form fields PDF.", flush=True)',
                ]
            )
        with open(script_path, "w", encoding="utf-8") as f:
            f.write("\n".join(lines))
    except Exception as e:
        logger.error(f"Failed to generate form fields script: {e}")
    return script_path

run_standalone_script(script_path)

Execute a standalone script for PDF form field creation.

Parameters:

Name Type Description Default
script_path str

Path to the script to run.

required
Source code in flyfield/markup_and_fields.py
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
def run_standalone_script(script_path: str) -> None:
    """
    Execute a standalone script for PDF form field creation.

    Args:
        script_path (str): Path to the script to run.
    """
    print(f"Running generated form field creation script: {script_path}")
    try:
        result = subprocess.run([sys.executable, "-u", script_path], text=True)
        if result.returncode != 0:
            raise RuntimeError(
                f"Generated script failed with exit code {result.returncode}"
            )
    except Exception as e:
        logger.error(f"Error running generated script: {e}")

run_fill_pdf_fields(csv_path, output_pdf_path, template_pdf_path, generator_script_path, boxes=None)

Generates and runs a standalone Python script to fill PDF form fields using PyPDFForm,

based on data from a CSV file with 'code' and 'fill' columns.

Parameters:

Name Type Description Default
csv_path str

Path to the CSV input file.

required
output_pdf_path str

Path where the filled PDF should be saved.

required
template_pdf_path str

Path to the input (template) PDF file.

required
generator_script_path str

Path where the generated fill script will be saved.

required
Source code in flyfield/markup_and_fields.py
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
def run_fill_pdf_fields(
    csv_path: str,
    output_pdf_path: str,
    template_pdf_path: str,
    generator_script_path: str,
    boxes: Optional[Dict[int, List[Dict]]] = None,
) -> None:
    """
    Generates and runs a standalone Python script to fill PDF form fields using PyPDFForm,

    based on data from a CSV file with 'code' and 'fill' columns.

    Args:
        csv_path (str): Path to the CSV input file.
        output_pdf_path (str): Path where the filled PDF should be saved.
        template_pdf_path (str): Path to the input (template) PDF file.
        generator_script_path (str): Path where the generated fill script will be saved.
    """
    from .utils import format_money_space, parse_money_space

    fill_data = {}
    try:
        with open(csv_path, newline="", encoding="utf-8") as f:
            reader = csv.DictReader(f)
            rows = []
            for row in reader:
                # Clean row values

                stripped_row = {
                    k: v.strip() if isinstance(v, str) else v for k, v in row.items()
                }
                if all(v == "" or v == "0" for v in stripped_row.values()):
                    continue
                rows.append(stripped_row)
        # Flatten boxes if any and merge to rows

        if boxes:
            flat_boxes = [entry for sublist in boxes.values() for entry in sublist]
            conditional_merge_list(rows, flat_boxes, "code", ["field_type"])
        for row in rows:
            field = row.get("code")
            value = row.get("fill")
            field_type = row.get("field_type", "")
            if not field or value in ("", "0"):
                continue
            if field_type in ("Dollars", "DollarCents"):
                decimal = field_type == "DollarCents"
                try:
                    amount = parse_money_space(value, decimal=decimal)
                    value = format_money_space(amount, decimal=decimal)
                except Exception as e:
                    print(
                        f"Warning: Could not format value '{value}' for field_type '{field_type}': {e}"
                    )
            elif field_type in ("Currency", "CurrencyDecimal"):
                import re

                value = re.sub(r"\D", "", value)
            fill_data[field] = value
    except Exception as e:
        print(f"Error reading CSV {csv_path}: {e}")
        return
    fill_dict_items = ",\n ".join(f'"{k}": {repr(v)}' for k, v in fill_data.items())
    script_content = f"""\
from PyPDFForm import PdfWrapper
print("Starting to fill PDF fields...", flush=True)
try:
    filled = PdfWrapper(
        "{template_pdf_path}",
        adobe_mode=False
    ).fill(
        {{
            {fill_dict_items}
        }},
        flatten=False
    )
    filled.write("{output_pdf_path}")
    print("Filled PDF saved to {output_pdf_path}", flush=True)
except Exception as e:
    print(f"Exception during filling: {{e}}", file=sys.stderr, flush=True)
    sys.exit(1)
"""

    try:
        with open(generator_script_path, "w", encoding="utf-8") as script_file:
            script_file.write(script_content)
        print(f"Generated fill script saved to {generator_script_path}")
    except Exception as e:
        print(f"Error writing fill script to {generator_script_path}: {e}")
        return
    try:
        result = subprocess.run(
            [sys.executable, generator_script_path], capture_output=True, text=True
        )
        print("Fill script stdout:")
        print(result.stdout)
        print("Fill script stderr:")
        print(result.stderr)
        if result.returncode != 0:
            print(f"Fill script failed with exit code {result.returncode}")
        else:
            print("Fill script completed successfully.")
    except Exception as e:
        print(f"Error running fill script: {e}")

flyfield.utils

General utility functions.

Helper functions for parsing, formatting, and validation.

add_suffix_to_filename(filename, suffix)

Add a suffix before the file extension in a filename.

Parameters:

Name Type Description Default
filename str

Original filename.

required
suffix str

Suffix to add.

required

Returns:

Name Type Description
str str

Filename with suffix added.

Source code in flyfield/utils.py
14
15
16
17
18
19
20
21
22
23
24
25
26
def add_suffix_to_filename(filename: str, suffix: str) -> str:
    """
    Add a suffix before the file extension in a filename.

    Args:
        filename (str): Original filename.
        suffix (str): Suffix to add.

    Returns:
        str: Filename with suffix added.
    """
    base, ext = os.path.splitext(filename)
    return f"{base}{suffix}{ext}"

colour_match(color, target_color=TARGET_COLOUR, tol=0.001)

Check if a color matches a target within a tolerance.

Parameters:

Name Type Description Default
color tuple

RGB color tuple.

required
target_color tuple

RGB target color.

COLOR_WHITE
tol float

Allowed tolerance.

0.001

Returns:

Name Type Description
bool bool

True if colors match within tolerance.

Note

If the input color has an alpha channel (RGBA), the alpha component is ignored.

Source code in flyfield/utils.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def colour_match(
    color: Tuple[float, ...],
    target_color: Tuple[float, float, float] = TARGET_COLOUR,
    tol: float = 1e-3,
) -> bool:
    """
    Check if a color matches a target within a tolerance.

    Args:
        color (tuple): RGB color tuple.
        target_color (tuple): RGB target color.
        tol (float): Allowed tolerance.

    Returns:
        bool: True if colors match within tolerance.

    Note:
        If the input color has an alpha channel (RGBA), the alpha component is ignored.
    """
    if not color or len(color) < 3:
        return False
    # Compare only RGB channels; ignore alpha if present

    return all(abs(a - b) < tol for a, b in zip(color[:3], target_color))

int_to_rgb(color_int)

Convert a 24-bit integer color in 0xRRGGBB format to normalized RGB tuple of floats.

Parameters:

Name Type Description Default
color_int int

Integer encoding color as 0xRRGGBB.

required

Returns:

Name Type Description
tuple Tuple[float, float, float]

Normalized (r, g, b) floats in range [0.0, 1.0].

Source code in flyfield/utils.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def int_to_rgb(color_int: int) -> Tuple[float, float, float]:
    """
    Convert a 24-bit integer color in 0xRRGGBB format to normalized RGB tuple of floats.

    Args:
        color_int (int): Integer encoding color as 0xRRGGBB.

    Returns:
        tuple: Normalized (r, g, b) floats in range [0.0, 1.0].
    """
    r = ((color_int >> 16) & 0xFF) / 255
    g = ((color_int >> 8) & 0xFF) / 255
    b = (color_int & 0xFF) / 255
    return (r, g, b)

clean_fill_string(line_text)

Clean a concatenated fill text string by removing single spaces while preserving double spaces as single spaces.

Parameters:

Name Type Description Default
line_text str

Raw line text.

required

Returns:

Name Type Description
str str

Cleaned fill string.

Source code in flyfield/utils.py
71
72
73
74
75
76
77
78
79
80
81
82
83
84
def clean_fill_string(line_text: str) -> str:
    """
    Clean a concatenated fill text string by removing single spaces while preserving double spaces as single spaces.

    Args:
        line_text (str): Raw line text.

    Returns:
        str: Cleaned fill string.
    """
    line_text = re.sub(r" {2,}", "<<<SPACE>>>", line_text)
    line_text = line_text.replace(" ", "")
    line_text = line_text.replace("<<<SPACE>>>", " ")
    return line_text

allowed_text(text, field_type=None)

Determine if text is allowed based on predefined rules and field type.

Helps to filter out pre-filled or invalid box contents.

Parameters:

Name Type Description Default
text str

Text extracted from a box.

required
field_type str or None

Optional current field type guess to refine allowed patterns.

None

Returns:

Name Type Description
tuple Tuple[bool, Optional[str]]

(bool indicating if allowed, detected field type or None)

Source code in flyfield/utils.py
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def allowed_text(
    text: str, field_type: Optional[str] = None
) -> Tuple[bool, Optional[str]]:
    """
    Determine if text is allowed based on predefined rules and field type.

    Helps to filter out pre-filled or invalid box contents.

    Args:
        text (str): Text extracted from a box.
        field_type (str or None): Optional current field type guess to refine allowed patterns.

    Returns:
        tuple: (bool indicating if allowed, detected field type or None)
    """
    allowed_text_by_type = {
        "DollarCents": {".", ".00."},
        "Dollars": {".00", ".00.00"},
    }
    generic_allowed_text = {"S", "M", "I", "T", "H"}
    if field_type in allowed_text_by_type:
        allowed_set = allowed_text_by_type[field_type] | generic_allowed_text
        if text in allowed_set:
            return True, field_type
        else:
            return False, None
    else:
        for ftype, texts in allowed_text_by_type.items():
            if text in texts:
                return True, ftype
        if text in generic_allowed_text:
            return True, None
        return False, None

format_money_space(amount, decimal=True)

Format a numeric amount to a string with space-separated thousands and optional decimal.

Parameters:

Name Type Description Default
amount float or int

Numeric amount to format.

required
decimal bool

Whether to include two decimal places.

True

Returns:

Name Type Description
str str

Formatted monetary string.

Source code in flyfield/utils.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def format_money_space(amount: float, decimal: bool = True) -> str:
    """
    Format a numeric amount to a string with space-separated thousands and optional decimal.

    Args:
        amount (float or int): Numeric amount to format.
        decimal (bool): Whether to include two decimal places.

    Returns:
        str: Formatted monetary string.
    """
    if decimal:
        s = f"{amount:,.2f}"
        int_part, dec_part = s.split(".")
        int_part = int_part.replace(",", " ")
        return f"{int_part} {dec_part}"
    else:
        s = f"{int(amount):,}"
        int_part = s.replace(",", " ")
        return int_part

parse_money_space(money_str, decimal=True)

Parse a monetary string with optional implied decimal space formatting.

Parameters:

Name Type Description Default
money_str str

Monetary string to parse (e.g., "12 345" means 123.45 if decimal is True).

required
decimal bool

Whether the last two digits represent cents (default True).

True

Returns:

Name Type Description
float float

Parsed monetary value as a float.

Source code in flyfield/utils.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
def parse_money_space(money_str: str, decimal: bool = True) -> float:
    """
    Parse a monetary string with optional implied decimal space formatting.

    Args:
        money_str (str): Monetary string to parse (e.g., "12 345" means 123.45 if decimal is True).
        decimal (bool): Whether the last two digits represent cents (default True).

    Returns:
        float: Parsed monetary value as a float.
    """
    if decimal:
        if " " in money_str:
            parts = money_str.rsplit(" ", 1)
            int_part = parts[0].replace(" ", "")
            dec_part = parts[1]
            combined = f"{int_part}.{dec_part}"
            return float(combined)
        else:
            # No decimal part found, treat as int

            return float(money_str.replace(" ", ""))
    else:
        return int(money_str.replace(" ", ""))

parse_implied_decimal(s)

Parse a numeric string with implied decimal (last two digits as decimals).

Parameters:

Name Type Description Default
s str

Numeric string (e.g., "12345" -> 123.45).

required

Returns:

Name Type Description
float float

Parsed float value.

Source code in flyfield/utils.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
def parse_implied_decimal(s: str) -> float:
    """
    Parse a numeric string with implied decimal (last two digits as decimals).

    Args:
        s (str): Numeric string (e.g., "12345" -> 123.45).

    Returns:
        float: Parsed float value.
    """
    s = s.strip()
    digits_only = re.sub(r"\D", "", s)

    if not digits_only:
        return 0.0
    if len(digits_only) <= 2:
        # If only 1 or 2 digits, treat as fractional part

        combined = f"0.{digits_only.zfill(2)}"
    else:
        combined = f"{digits_only[:-2]}.{digits_only[-2:]}"
    return float(combined)

version()

Return the current version string of the library/module.

Returns:

Name Type Description
str str

Version string.

Source code in flyfield/utils.py
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
def version() -> str:
    """
    Return the current version string of the library/module.

    Returns:
        str: Version string.
    """
    try:
        # Python 3.8+

        from importlib.metadata import PackageNotFoundError
        from importlib.metadata import version as pkg_version
    except ImportError:
        # For Python <3.8

        from importlib_metadata import PackageNotFoundError
        from importlib_metadata import version as pkg_version
    try:
        return pkg_version("flyfield")
    except PackageNotFoundError:
        return "unknown"

parse_pages(pages_str)

Parse a string specifying pages or page ranges into a list of page integers.

Parameters:

Name Type Description Default
pages_str str

Pages specified as a comma-separated list or ranges (e.g., "1,3-5").

required

Returns:

Type Description
List[int]

list[int]: List of individual page numbers.

Source code in flyfield/utils.py
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
def parse_pages(pages_str: str) -> List[int]:
    """
    Parse a string specifying pages or page ranges into a list of page integers.

    Args:
        pages_str (str): Pages specified as a comma-separated list or ranges (e.g., "1,3-5").

    Returns:
        list[int]: List of individual page numbers.
    """
    pages = set()
    for part in pages_str.split(","):
        part = part.strip()
        if "-" in part:
            start_str, end_str = part.split("-")
            start, end = int(start_str), int(end_str)
            pages.update(range(start, end + 1))
        else:
            pages.add(int(part))
    return sorted(pages)

conditional_merge_list(main_list, ref_list, match_key, keys_to_merge)

Conditionally merge dictionaries in a main list with those in a reference list.

Parameters:

Name Type Description Default
main_list list[dict]

Primary list of dictionaries.

required
ref_list list[dict]

Reference list of dictionaries.

required
match_key str

Key to match dictionaries.

required
keys_to_merge list[str]

Keys to merge from ref_list into main_list.

required

Returns:

Name Type Description
None None

Modifies main_list in place.

Source code in flyfield/utils.py
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
def conditional_merge_list(
    main_list: List[Dict],
    ref_list: List[Dict],
    match_key: str,
    keys_to_merge: List[str],
) -> None:
    """
    Conditionally merge dictionaries in a main list with those in a reference list.

    Args:
        main_list (list[dict]): Primary list of dictionaries.
        ref_list (list[dict]): Reference list of dictionaries.
        match_key (str): Key to match dictionaries.
        keys_to_merge (list[str]): Keys to merge from ref_list into main_list.

    Returns:
        None: Modifies main_list in place.
    """
    # Build lookup dictionary for efficient matching

    ref_lookup = {item[match_key]: item for item in ref_list if match_key in item}
    for record in main_list:
        ref_record = ref_lookup.get(record.get(match_key))
        if ref_record:
            for key in keys_to_merge:
                if key in ref_record:
                    record[key] = ref_record[key]