Gmod Lua Lexer: A Beginner’s Guide to Tokenizing Garry’s Mod Scripts

Understanding Gmod Lua Lexer Internals: Tokens, States, and Patterns

A lexer (tokenizer) converts raw source text into a stream of tokens that a parser can consume. For Garry’s Mod (Gmod) Lua — standard Lua extended with Gmod-specific APIs and conventions — an effective lexer must handle Lua syntax, Gmod idioms, and common addon patterns. This article explains lexer internals with practical examples, design choices, and pitfalls to watch for.

Why a custom Gmod Lua lexer?

Simplified parsing: Token streams make parsing straightforward and robust.
Tooling: Syntax highlighting, static analysis, and refactoring tools depend on accurate tokenization.
Gmod specifics: Files often contain embedded code blocks, localized comment patterns, or custom preprocessor-like constructs (e.g., serverside/clientside markers) that vanilla Lua lexers might not expect.

Core concepts

Tokens

A token is a classified chunk of text representing an atomic language element. Typical token types for Gmod Lua:

Keywords: e.g., if, else, function, local, return
Identifiers: variable and function names
Literals: numbers, strings, boolean, nil
Operators and punctuation: + -/ == = <= >= = . , ; : :: ( ) { } [ ]

Comments: single-line (–) and multi-line (–[[ … ]])

Whitespace: often skipped but sometimes tracked for tooling

Gmod-specific markers: e.g., if SERVER then or if CLIENT then blocks (lexer treats them as keywords+identifiers but tooling may note them)

Preprocessor-like tokens: some projects use tags like @shared, @server in comments — treat as comment tokens, optionally parsed further.

Token structure (recommended):

type: token kind (string/enum)

value: raw text or parsed value (e.g., number as numeric)

line, col: start position for diagnostics

length / end position: optional

Example token object:

lua
{ type = “IDENT”, value = “net”, line = 12, col = 5 }

States

Lexers often use a finite set of states to correctly parse context-sensitive constructs:

Default: scanning general code

String: inside a string literal (track delimiter and escapes)

Long bracket: Lua’s [[ … ]] multimode string/comment state

Comment: when inside – or long comment

Number parsing: decimal, hex, with exponent handling (often handled inline)

Preprocessor / annotation parsing: if you want to extract tags from comments

State transitions:

From Default, upon encountering “ or ‘ → String state.

From Default, upon – → Comment (line) or Long bracket (if –[[) state.

From String state, handle escapes () and end on matching delimiter.

Long bracket state must handle level of = signs: [=[ … ]=].

Using an explicit state stack simplifies nested long brackets or interpolations if introduced.

Patterns and Matching

Lexers often use regex-like patterns or manual character inspection. For Gmod Lua in Lua itself, a common approach is a mix: fast pattern searches for simple tokens and character-at-a-time for tricky constructs.

Key patterns:

Identifier: ^[A-Za-z][A-Za-z0-9]

Number: complex; support decimal, hex (0x…), fractional part, exponent (e/E)

String: start with ” or ’ and allow escapes “, </code>, , etc.
Long bracket: %[%=[ … %]%=] (must match level of = signs) Comment: Line: –.$ Long: –%[%=[ … %]%=]
Be cautious: Lua’s long bracket delimiter can include equals signs ([=[ … ]=]), so you must capture the exact sequence when opening and require an identical sequence to close. Example Lua pattern (simplified) to find long brackets: lua local start = source:find(”%[(=*)%[”, pos) local eqs = source:sub(start+1, start+#eqs) – then construct closing pattern “%]” .. eqs .. “%]” Example lexer flow (pseudo)


Initialize position, line, col, state = Default.
While not end:

If state == Default:

Skip whitespace; update pos/line/col.
If next two chars == ‘–’ then enter Comment (line or long).
If char == ‘“’ or “‘” enter String, record delimiter.
If char == ‘[’ check for long bracket; if so enter LongBracket.
Match identifiers/keywords via pattern; numbers via numeric pattern.
Emit token for operators/punctuation (handle two-char operators like ==, <=, =).


If state == String:

Consume until unescaped delimiter; handle escapes; emit STRING token.


If state == LongBracket:

Scan until matching closing bracket level; emit LONG_STRING or LONG_COMMENT.


If state == Comment:

Consume to end-of-line; emit COMMENT token.





Handling edge cases

Unterminated strings/long brackets: lexers should report clear diagnostics with line/col and attempt to recover (e.g., treat rest of file as string or stop at EOF).
Nested long brackets: Lua does not nest by delimiter; treat inner brackets as content.
Escape sequences: decide whether to unescape string values in lexer or leave raw text for parser.
CRLF vs LF: normalize newlines consistently for line/column tracking.
Performance: avoid repeated substring allocations. Use indices into the original source and only allocate token.value when needed.
Large files / addons: stream processing or chunked reading reduces memory usage.

Gmod-specific considerations

Many addons include data files or chunked code in comments — consider scanning comments for annotations (@param, @server) and emitting structured annotation tokens.
Common patterns like if SERVER then/if CLIENT then might be used by tools to split files into client/server parts. A post-lexing pass that recognizes these conditional blocks can be practical.
Sandboxed environments and custom preprocessors: if your tool must operate on packed/obfuscated code, add a preprocessing stage (decompression, deobfuscation) before lexing.

Testing and validation

Create fuzz tests with random inputs including all edge constructs (unterminated strings, long brackets with varying = counts, odd Unicode identifiers).
Unit tests for token sequences for representative Gmod addons: weapons, gamemodes, HUDs.
Performance benchmarks on large addons and shared repositories.

Sample minimal Lua lexer snippet (conceptual)
lua
– conceptual: not production-ready
local function lex(source)
  local pos, len, line, col = 1, #source, 1, 1
  local tokens = {}
  local function emit(type, value)
    tokens[#tokens+1] = {type=type, value=value, line=line, col=col}
  end
  while pos <= len do
    local ch = source:sub(pos,pos)
    if ch:match(”%s”) then
      if ch == ” “ then line = line + 1; col = 1 else col = col + 1 end
      pos = pos + 1
    elseif ch == ”-” and source:sub(pos,pos+1) == ”–” then
      – line comment
      local s,e = source:find(” “, pos+2) or (len+1)
      emit(“COMMENT”, source:sub(pos, e-1))
      pos = e     elseif ch == ’“’ or ch == ”‘” then
      local start = pos       pos = pos + 1
      while pos <= len do
        local c = source:sub(pos,pos)
        if c == ”\” then pos = pos + 2
        elseif c == ch then pos = pos + 1; break
        else pos = pos + 1 end
      end
      emit(“STRING”, source:sub(start, pos-1))
    else
      – identifiers, numbers, operators simplified…
      pos = pos + 1
    end
  end
  return tokens end

Conclusion
A robust Gmod Lua lexer balances correctness (handling Lua’s nuanced long brackets and escape rules), performance (minimal copying, streaming where needed), and Gmod-specific duties (annotations, server/client splitting). Implement explicit lexer states, precise long-bracket matching, and thorough tests. For tooling, consider a post-lexing pass to extract Gmod annotations and conditional blocks.

Gmod Lua Lexer: A Beginner’s Guide to Tokenizing Garry’s Mod Scripts

Understanding Gmod Lua Lexer Internals: Tokens, States, and Patterns

Why a custom Gmod Lua lexer?

Core concepts

Tokens

States

Patterns and Matching

Example lexer flow (pseudo)

Handling edge cases

Gmod-specific considerations

Testing and validation

Sample minimal Lua lexer snippet (conceptual)

Conclusion

Comments