Understanding Gmod Lua Lexer Internals: Tokens, States, and Patterns
A lexer (tokenizer) converts raw source text into a stream of tokens that a parser can consume. For Garry’s Mod (Gmod) Lua — standard Lua extended with Gmod-specific APIs and conventions — an effective lexer must handle Lua syntax, Gmod idioms, and common addon patterns. This article explains lexer internals with practical examples, design choices, and pitfalls to watch for.
Why a custom Gmod Lua lexer?
- Simplified parsing: Token streams make parsing straightforward and robust.
- Tooling: Syntax highlighting, static analysis, and refactoring tools depend on accurate tokenization.
- Gmod specifics: Files often contain embedded code blocks, localized comment patterns, or custom preprocessor-like constructs (e.g., serverside/clientside markers) that vanilla Lua lexers might not expect.
Core concepts
Tokens
A token is a classified chunk of text representing an atomic language element. Typical token types for Gmod Lua:
- Keywords: e.g., if, else, function, local, return
- Identifiers: variable and function names
- Literals: numbers, strings, boolean, nil
- Operators and punctuation: + -/ == = <= >= = . , ; : :: ( ) { } [ ]
- Comments: single-line (
–) and multi-line (–[[ … ]]) - Whitespace: often skipped but sometimes tracked for tooling
- Gmod-specific markers: e.g.,
if SERVER thenorif CLIENT thenblocks (lexer treats them as keywords+identifiers but tooling may note them) - Preprocessor-like tokens: some projects use tags like
@shared,@serverin comments — treat as comment tokens, optionally parsed further.
Token structure (recommended):
- type: token kind (string/enum)
- value: raw text or parsed value (e.g., number as numeric)
- line, col: start position for diagnostics
- length / end position: optional
Example token object:
lua
{ type = “IDENT”, value = “net”, line = 12, col = 5 }
States
Lexers often use a finite set of states to correctly parse context-sensitive constructs:
- Default: scanning general code
- String: inside a string literal (track delimiter and escapes)
- Long bracket: Lua’s
[[ … ]]multimode string/comment state - Comment: when inside
–or long comment - Number parsing: decimal, hex, with exponent handling (often handled inline)
- Preprocessor / annotation parsing: if you want to extract tags from comments
State transitions:
- From Default, upon encountering
“or‘→ String state. - From Default, upon
–→ Comment (line) or Long bracket (if–[[) state. - From String state, handle escapes (
) and end on matching delimiter. - Long bracket state must handle level of
=signs:[=[ … ]=].
Using an explicit state stack simplifies nested long brackets or interpolations if introduced.
Patterns and Matching
Lexers often use regex-like patterns or manual character inspection. For Gmod Lua in Lua itself, a common approach is a mix: fast pattern searches for simple tokens and character-at-a-time for tricky constructs.
Key patterns:
- Identifier: ^[A-Za-z][A-Za-z0-9]
- Number: complex; support decimal, hex (0x…), fractional part, exponent (e/E)
- String: start with ” or ’ and allow escapes
“,</code>,, etc.
- Long bracket:
%[%=[…%]%=](must match level of=signs) - Comment:
- Line:
–.$ - Long:
–%[%=[…%]%=]
- Line:
Be cautious: Lua’s long bracket delimiter can include equals signs ([=[ … ]=]), so you must capture the exact sequence when opening and require an identical sequence to close.
Example Lua pattern (simplified) to find long brackets:
lua
local start = source:find(”%[(=*)%[”, pos) local eqs = source:sub(start+1, start+#eqs) – then construct closing pattern “%]” .. eqs .. “%]”
Example lexer flow (pseudo)
- Initialize position, line, col, state = Default.
- While not end:
- If state == Default:
- Skip whitespace; update pos/line/col.
- If next two chars == ‘–’ then enter Comment (line or long).
- If char == ‘“’ or “‘” enter String, record delimiter.
- If char == ‘[’ check for long bracket; if so enter LongBracket.
- Match identifiers/keywords via pattern; numbers via numeric pattern.
- Emit token for operators/punctuation (handle two-char operators like
==,<=,=).
- If state == String:
- Consume until unescaped delimiter; handle escapes; emit STRING token.
- If state == LongBracket:
- Scan until matching closing bracket level; emit LONG_STRING or LONG_COMMENT.
- If state == Comment:
- Consume to end-of-line; emit COMMENT token.
- If state == Default:
Handling edge cases
- Unterminated strings/long brackets: lexers should report clear diagnostics with line/col and attempt to recover (e.g., treat rest of file as string or stop at EOF).
- Nested long brackets: Lua does not nest by delimiter; treat inner brackets as content.
- Escape sequences: decide whether to unescape string values in lexer or leave raw text for parser.
- CRLF vs LF: normalize newlines consistently for line/column tracking.
- Performance: avoid repeated substring allocations. Use indices into the original source and only allocate token.value when needed.
- Large files / addons: stream processing or chunked reading reduces memory usage.
Gmod-specific considerations
- Many addons include data files or chunked code in comments — consider scanning comments for annotations (
@param,@server) and emitting structured annotation tokens. - Common patterns like
if SERVER then/if CLIENT thenmight be used by tools to split files into client/server parts. A post-lexing pass that recognizes these conditional blocks can be practical. - Sandboxed environments and custom preprocessors: if your tool must operate on packed/obfuscated code, add a preprocessing stage (decompression, deobfuscation) before lexing.
Testing and validation
- Create fuzz tests with random inputs including all edge constructs (unterminated strings, long brackets with varying
=counts, odd Unicode identifiers). - Unit tests for token sequences for representative Gmod addons: weapons, gamemodes, HUDs.
- Performance benchmarks on large addons and shared repositories.
Sample minimal Lua lexer snippet (conceptual)
lua
– conceptual: not production-ready local function lex(source) local pos, len, line, col = 1, #source, 1, 1 local tokens = {} local function emit(type, value) tokens[#tokens+1] = {type=type, value=value, line=line, col=col} end while pos <= len do local ch = source:sub(pos,pos) if ch:match(”%s”) then if ch == ” “ then line = line + 1; col = 1 else col = col + 1 end pos = pos + 1 elseif ch == ”-” and source:sub(pos,pos+1) == ”–” then – line comment local s,e = source:find(” “, pos+2) or (len+1) emit(“COMMENT”, source:sub(pos, e-1)) pos = e elseif ch == ’“’ or ch == ”‘” then local start = pos pos = pos + 1 while pos <= len do local c = source:sub(pos,pos) if c == ”\” then pos = pos + 2 elseif c == ch then pos = pos + 1; break else pos = pos + 1 end end emit(“STRING”, source:sub(start, pos-1)) else – identifiers, numbers, operators simplified… pos = pos + 1 end end return tokens end
Conclusion
A robust Gmod Lua lexer balances correctness (handling Lua’s nuanced long brackets and escape rules), performance (minimal copying, streaming where needed), and Gmod-specific duties (annotations, server/client splitting). Implement explicit lexer states, precise long-bracket matching, and thorough tests. For tooling, consider a post-lexing pass to extract Gmod annotations and conditional blocks.
Leave a Reply