Lexical Structure

This chapter defines the lexical grammar of Forge: how source text is decomposed into a sequence of tokens. The lexer (tokenizer) reads UTF-8 encoded source text and produces a flat stream of tokens that the parser consumes.

Overview

A Forge source file is a sequence of Unicode characters encoded as UTF-8. The lexer processes this character stream left-to-right, greedily matching the longest valid token at each position. The resulting token stream consists of:

Keywords — reserved words with special meaning (e.g., let, fn, set, define)
Identifiers — user-defined names for variables, functions, types, and fields
Literals — integer, float, string, boolean, and null values
Operators — arithmetic, comparison, logical, assignment, and special operators
Punctuation — delimiters and separators ((, ), {, }, [, ], ,, :, ;)
Comments — line comments and block comments, discarded during tokenization
Newlines — significant for statement termination
Decorators — @ prefixed annotations

Whitespace and Line Termination

Spaces (U+0020) and horizontal tabs (U+0009) are whitespace characters. They separate tokens but are otherwise insignificant and are not included in the token stream.

Newline characters (U+000A, and the sequence U+000D U+000A) serve as statement terminators. Unlike whitespace, newlines are emitted as Newline tokens because Forge uses newlines (rather than semicolons) to separate statements. Semicolons (;) are recognized as explicit statement terminators but are not required.

Tokenization Order

When the lexer encounters a character sequence, it applies the following precedence:

Skip whitespace (spaces and tabs).
If the character begins a comment (// or /*), consume the entire comment.
If the character is a newline, emit a Newline token.
If the character is a digit, lex a numeric literal (integer or float).
If the character is ", lex a string literal (or """ for raw strings).
If the character is a letter or underscore, lex an identifier or keyword.
Otherwise, lex an operator or punctuation token.

Each token carries a span consisting of the line number, column number, byte offset, and byte length within the source text.

Subsections

The following subsections define each lexical element in detail:

Source Text — encoding, file extension, line endings
Keywords — complete keyword list with dual-syntax mappings
Identifiers — naming rules and special identifiers
Literals — numeric, string, boolean, null, array, and object literals
Operators and Punctuation — all operator and delimiter tokens
Comments — line and block comment syntax

Keyboard shortcuts

The Forge Language Specification

Lexical Structure

Overview

Whitespace and Line Termination

Tokenization Order

Subsections