Lexical Structure
This chapter defines the lexical grammar of Forge: how source text is decomposed into a sequence of tokens. The lexer (tokenizer) reads UTF-8 encoded source text and produces a flat stream of tokens that the parser consumes.
Overview
A Forge source file is a sequence of Unicode characters encoded as UTF-8. The lexer processes this character stream left-to-right, greedily matching the longest valid token at each position. The resulting token stream consists of:
- Keywords — reserved words with special meaning (e.g.,
let,fn,set,define) - Identifiers — user-defined names for variables, functions, types, and fields
- Literals — integer, float, string, boolean, and null values
- Operators — arithmetic, comparison, logical, assignment, and special operators
- Punctuation — delimiters and separators (
(,),{,},[,],,,:,;) - Comments — line comments and block comments, discarded during tokenization
- Newlines — significant for statement termination
- Decorators —
@prefixed annotations
Whitespace and Line Termination
Spaces (U+0020) and horizontal tabs (U+0009) are whitespace characters. They separate tokens but are otherwise insignificant and are not included in the token stream.
Newline characters (U+000A, and the sequence U+000D U+000A) serve as statement terminators. Unlike whitespace, newlines are emitted as Newline tokens because Forge uses newlines (rather than semicolons) to separate statements. Semicolons (;) are recognized as explicit statement terminators but are not required.
Tokenization Order
When the lexer encounters a character sequence, it applies the following precedence:
- Skip whitespace (spaces and tabs).
- If the character begins a comment (
//or/*), consume the entire comment. - If the character is a newline, emit a
Newlinetoken. - If the character is a digit, lex a numeric literal (integer or float).
- If the character is
", lex a string literal (or"""for raw strings). - If the character is a letter or underscore, lex an identifier or keyword.
- Otherwise, lex an operator or punctuation token.
Each token carries a span consisting of the line number, column number, byte offset, and byte length within the source text.
Subsections
The following subsections define each lexical element in detail:
- Source Text — encoding, file extension, line endings
- Keywords — complete keyword list with dual-syntax mappings
- Identifiers — naming rules and special identifiers
- Literals — numeric, string, boolean, null, array, and object literals
- Operators and Punctuation — all operator and delimiter tokens
- Comments — line and block comment syntax