Gmod Lua Lexer: A Beginner’s Guide to Tokenizing Garry’s Mod Scripts

Understanding Gmod Lua Lexer Internals: Tokens, States, and Patterns

A lexer (tokenizer) converts raw source text into a stream of tokens that a parser can consume. For Garry’s Mod (Gmod) Lua — standard Lua extended with Gmod-specific APIs and conventions — an effective lexer must handle Lua syntax, Gmod idioms, and common addon patterns. This article explains lexer internals with practical examples, design choices, and pitfalls to watch for.

Why a custom Gmod Lua lexer?

  • Simplified parsing: Token streams make parsing straightforward and robust.
  • Tooling: Syntax highlighting, static analysis, and refactoring tools depend on accurate tokenization.
  • Gmod specifics: Files often contain embedded code blocks, localized comment patterns, or custom preprocessor-like constructs (e.g., serverside/clientside markers) that vanilla Lua lexers might not expect.

Core concepts

Tokens

A token is a classified chunk of text representing an atomic language element. Typical token types for Gmod Lua:

  • Keywords: e.g., if, else, function, local, return
  • Identifiers: variable and function names
  • Literals: numbers, strings, boolean, nil
  • Operators and punctuation: + -/ == = <= >= = . , ; : :: ( ) { } [ ]
  • Comments: single-line () and multi-line (–[[ … ]])
  • Whitespace: often skipped but sometimes tracked for tooling
  • Gmod-specific markers: e.g., if SERVER then or if CLIENT then blocks (lexer treats them as keywords+identifiers but tooling may note them)
  • Preprocessor-like tokens: some projects use tags like @shared, @server in comments — treat as comment tokens, optionally parsed further.

Token structure (recommended):

  • type: token kind (string/enum)
  • value: raw text or parsed value (e.g., number as numeric)
  • line, col: start position for diagnostics
  • length / end position: optional

Example token object:

lua

{ type = “IDENT”, value = “net”, line = 12, col = 5 }

States

Lexers often use a finite set of states to correctly parse context-sensitive constructs:

  • Default: scanning general code
  • String: inside a string literal (track delimiter and escapes)
  • Long bracket: Lua’s [[ … ]] multimode string/comment state
  • Comment: when inside or long comment
  • Number parsing: decimal, hex, with exponent handling (often handled inline)
  • Preprocessor / annotation parsing: if you want to extract tags from comments

State transitions:

  • From Default, upon encountering or → String state.
  • From Default, upon → Comment (line) or Long bracket (if –[[) state.
  • From String state, handle escapes () and end on matching delimiter.
  • Long bracket state must handle level of = signs: [=[ … ]=].

Using an explicit state stack simplifies nested long brackets or interpolations if introduced.

Patterns and Matching

Lexers often use regex-like patterns or manual character inspection. For Gmod Lua in Lua itself, a common approach is a mix: fast pattern searches for simple tokens and character-at-a-time for tricky constructs.

Key patterns:

  • Identifier: ^[A-Za-z][A-Za-z0-9]
  • Number: complex; support decimal, hex (0x…), fractional part, exponent (e/E)
  • String: start with ” or ’ and allow escapes , </code>,
    , etc.
  • Long bracket: %[%=[%]%=] (must match level of = signs)
  • Comment:
    • Line: –.$
    • Long: –%[%=[%]%=]

Be cautious: Lua’s long bracket delimiter can include equals signs ([=[ … ]=]), so you must capture the exact sequence when opening and require an identical sequence to close.

Example Lua pattern (simplified) to find long brackets:

lua

local start = source:find(”%[(=*)%[”, pos) local eqs = source:sub(start+1, start+#eqs) – then construct closing pattern “%]” .. eqs .. “%]”

Example lexer flow (pseudo)

  1. Initialize position, line, col, state = Default.
  2. While not end:
    • If state == Default:
      • Skip whitespace; update pos/line/col.
      • If next two chars == ‘–’ then enter Comment (line or long).
      • If char == ‘“’ or “‘” enter String, record delimiter.
      • If char == ‘[’ check for long bracket; if so enter LongBracket.
      • Match identifiers/keywords via pattern; numbers via numeric pattern.
      • Emit token for operators/punctuation (handle two-char operators like ==, <=, =).
    • If state == String:
      • Consume until unescaped delimiter; handle escapes; emit STRING token.
    • If state == LongBracket:
      • Scan until matching closing bracket level; emit LONG_STRING or LONG_COMMENT.
    • If state == Comment:
      • Consume to end-of-line; emit COMMENT token.

Handling edge cases

  • Unterminated strings/long brackets: lexers should report clear diagnostics with line/col and attempt to recover (e.g., treat rest of file as string or stop at EOF).
  • Nested long brackets: Lua does not nest by delimiter; treat inner brackets as content.
  • Escape sequences: decide whether to unescape string values in lexer or leave raw text for parser.
  • CRLF vs LF: normalize newlines consistently for line/column tracking.
  • Performance: avoid repeated substring allocations. Use indices into the original source and only allocate token.value when needed.
  • Large files / addons: stream processing or chunked reading reduces memory usage.

Gmod-specific considerations

  • Many addons include data files or chunked code in comments — consider scanning comments for annotations (@param, @server) and emitting structured annotation tokens.
  • Common patterns like if SERVER then/if CLIENT then might be used by tools to split files into client/server parts. A post-lexing pass that recognizes these conditional blocks can be practical.
  • Sandboxed environments and custom preprocessors: if your tool must operate on packed/obfuscated code, add a preprocessing stage (decompression, deobfuscation) before lexing.

Testing and validation

  • Create fuzz tests with random inputs including all edge constructs (unterminated strings, long brackets with varying = counts, odd Unicode identifiers).
  • Unit tests for token sequences for representative Gmod addons: weapons, gamemodes, HUDs.
  • Performance benchmarks on large addons and shared repositories.

Sample minimal Lua lexer snippet (conceptual)

lua

– conceptual: not production-ready local function lex(source) local pos, len, line, col = 1, #source, 1, 1 local tokens = {} local function emit(type, value) tokens[#tokens+1] = {type=type, value=value, line=line, col=col} end while pos <= len do local ch = source:sub(pos,pos) if ch:match(”%s”) then if ch == ” “ then line = line + 1; col = 1 else col = col + 1 end pos = pos + 1 elseif ch == ”-” and source:sub(pos,pos+1) == ”–” then – line comment local s,e = source:find(” “, pos+2) or (len+1) emit(“COMMENT”, source:sub(pos, e-1)) pos = e elseif ch == ’“’ or ch == ”‘” then local start = pos pos = pos + 1 while pos <= len do local c = source:sub(pos,pos) if c == ”\” then pos = pos + 2 elseif c == ch then pos = pos + 1; break else pos = pos + 1 end end emit(“STRING”, source:sub(start, pos-1)) else – identifiers, numbers, operators simplified… pos = pos + 1 end end return tokens end

Conclusion

A robust Gmod Lua lexer balances correctness (handling Lua’s nuanced long brackets and escape rules), performance (minimal copying, streaming where needed), and Gmod-specific duties (annotations, server/client splitting). Implement explicit lexer states, precise long-bracket matching, and thorough tests. For tooling, consider a post-lexing pass to extract Gmod annotations and conditional blocks.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *