GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval

We build a practical GLM-5.2 workflow using its hosted, OpenAI-compatible API instead of running the model locally. We set up multiple providers, load the API key securely, and create a reusable chat wrapper. We then test thinking-effort control, streamed reasoning, function calling, a tool-using agent, structured JSON output, and long-context retrieval. We close with token and cost accounting so every demo stays measurable. The post GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning E...

MarkTechPost ·Sana Hassan ·

In this tutorial, we work with GLM-5.2 and use its hosted, OpenAI-compatible API instead of running the full model locally. We begin by setting up multiple provider options, securely loading the API key, and creating a reusable chat wrapper that supports normal chat, thinking mode, streaming, tool calling, and token tracking. Then we move beyond a simple chatbot example and test the model in more practical situations, including reasoning-effort control, streamed reasoning and answers, function calling, a small tool-using agent, structured JSON output, long-context retrieval, and cost estimation.

Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper

Copy CodeCopiedUse a different Browser

import sys, subprocess

subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], check=False)

import os, re, json, time, getpass

from openai import OpenAI

PROVIDERS = {

"zai": {"base_url": "https://api.z.ai/api/paas/v4/", "model": "glm-5.2", "env": "ZAI_API_KEY"},

"openrouter": {"base_url": "https://openrouter.ai/api/v1", "model": "z-ai/glm-5.2", "env": "OPENROUTER_API_KEY"},

"together": {"base_url": "https://api.together.xyz/v1", "model": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"},

"requesty": {"base_url": "https://router.requesty.ai/v1", "model": "zai/glm-5.2", "env": "REQUESTY_API_KEY"},

"huggingface": {"base_url": "https://router.huggingface.co/v1","model": "zai-org/GLM-5.2","env": "HF_TOKEN"},

PROVIDER = "zai"

CFG = PROVIDERS[PROVIDER]

MODEL = CFG["model"]

def load_api_key(env_name):

from google.colab import userdata

v = userdata.get(env_name)

if v: return v

except Exception:

if os.environ.get(env_name):

return os.environ[env_name]

return getpass.getpass(f"Enter your {env_name}: ")

client = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"])

PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40

_USAGE = {"in": 0, "out": 0, "calls": 0}

def _track(usage):

_USAGE["in"] += getattr(usage, "prompt_tokens", 0) or 0

_USAGE["out"] += getattr(usage, "completion_tokens", 0) or 0

_USAGE["calls"] += 1

def get_reasoning(obj):

"""Pull GLM's hidden reasoning trace from a message/delta (a provider-extra field)."""

val = getattr(obj, "reasoning_content", None)

if val: return val

extra = getattr(obj, "model_extra", None) or {}

if extra.get("reasoning_content"): return extra["reasoning_content"]

try: return obj.to_dict().get("reasoning_content")

except Exception: return None

def chat(messages, effort=None, thinking=True, tools=None, tool_choice="auto",

stream=False, max_tokens=2048, temperature=1.0, tool_stream=False):

effort: None | "high" | "max" (GLM-5.2 thinking-effort level; max is the model default)

thinking: True -> deep thinking on; False -> off (fast, cheap, low-latency)

GLM-specific params go through extra_body so any OpenAI client works.

extra = {"thinking": {"type": "enabled" if thinking else "disabled"}}

if effort and thinking: extra["reasoning_effort"] = effort

if tool_stream: extra["tool_stream"] = True

kwargs = dict(model=MODEL, messages=messages, max_tokens=max_tokens,

temperature=temperature, stream=stream, extra_body=extra)

kwargs.update(tools=tools, tool_choice=tool_choice)

kwargs["stream_options"] = {"include_usage": True}

return client.chat.completions.create(**kwargs)

We set up the complete foundation for using GLM-5.2 through an OpenAI-compatible API. We define multiple provider options, load the API key securely, create the OpenAI client, and set up token-cost tracking for the entire notebook. We also build a reusable chat wrapper so that every subsequent demo can use thinking mode, reasoning effort, streaming, tool calling, and provider-specific parameters cleanly.

Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2

Copy CodeCopiedUse a different Browser

def demo_basic():

print("\n=== 1. BASIC CHAT / SANITY CHECK =========================")

resp = chat([{"role": "system", "content": "You are a concise technical assistant."},

{"role": "user", "content": "In one sentence, what is GLM-5.2 best at?"}],

thinking=False, max_tokens=200)

_track(resp.usage)

print(resp.choices[0].message.content.strip())

def demo_effort():

print("\n=== 2. THINKING-EFFORT CONTROL (off / high / max) ========")

problem = ("Train A leaves city A at 9:00 going 60 km/h toward city B. "

"Train B leaves B (420 km away) at 9:30 going 90 km/h toward A. "

"At what clock time do they meet? Show the key steps briefly.")

for label, kw in [("thinking OFF", dict(thinking=False)),

("effort=high", dict(thinking=True, effort="high")),

("effort=max", dict(thinking=True, effort="max"))]:

t0 = time.time()

resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw)

dt = time.time() - t0

_track(resp.usage)

msg, u = resp.choices[0].message, resp.usage

print(f"\n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---")

r = get_reasoning(msg)

print(" [reasoning, first 220 chars]: " + " ".join(r.split())[:220] + " ...")

Please view this post in your web browser to complete the quiz.

: " + " ".join((msg.content or '').split())[:350])

def demo_streaming():

print("\n=== 3. STREAMING: reasoning channel vs answer channel ====")

stream = chat([{"role": "user", "content":

"Explain why the sky is blue, then give a one-line TL;DR."}],

thinking=True, effort="high", stream=True, max_tokens=1200)

saw_r = saw_a = False

usage = None

for chunk in stream:

if getattr(chunk, "usage", None): usage = chunk.usage

if not chunk.choices: continue

delta = chunk.choices[0].delta

r = get_reasoning(delta)

if not saw_r: print("\n[thinking] ", end="", flush=True); saw_r = True

print(r, end="", flush=True)

if getattr(delta, "content", None):

if not saw_a: print("\n\n ", end="", flush=True); saw_a = True

print(delta.content, end="", flush=True)

_track(usage)

We start testing GLM-5.2 with basic chat, reasoning-effort control, and streaming output. We first run a simple sanity check, then compare the same problem across thinking-off, high-effort, and max-effort modes to observe changes in latency and output tokens. We also stream the model response so we can view the reasoning channel and the final answer separately as the response is being generated.

Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent

Copy CodeCopiedUse a different Browser

def tool_calculator(expression: str):

if not re.fullmatch(r"[0-9+\-*/(). %]+", expression or ""):

return {"error": "unsupported characters"}

try: return {"result": eval(expression, {"__builtins__": {}}, {})}

except Exception as e: return {"error": str(e)}

_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,

"sao paulo": 22_400_000, "mexico city": 21_800_000}

def tool_city_population(city: str):

return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())}

{"type": "function", "function": {

"name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",

"parameters": {"type": "object", "properties": {"expression": {"type": "string"}},

"required": ["expression"]}}},

{"type": "function", "function": {

"name": "city_population", "description": "Look up the metro population of a city.",

"parameters": {"type": "object", "properties": {"city": {"type": "string"}},

"required": ["city"]}}},

TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}

def run_tool_loop(messages, max_rounds=6, effort="max"):

"""Full loop: model -> tool_calls -> execute -> feed results back -> repeat."""

for _ in range(max_rounds):

resp = chat(messages, tools=TOOLS, thinking=True, effort=effort,

max_tokens=1500, temperature=0.3)

_track(resp.usage)

m = resp.choices[0].message

if not getattr(m, "tool_calls", None):

return m.content

messages.append({

"role": "assistant", "content": m.content or "",

"tool_calls": [{"id": tc.id, "type": "function",

"function": {"name": tc.function.name,

"arguments": tc.function.arguments}}

for tc in m.tool_calls]})

for tc in m.tool_calls:

try: args = json.loads(tc.function.arguments or "{}")

except json.JSONDecodeError: args = {}

result = TOOL_IMPLS.get(tc.function.name, lambda **k: {"error": "unknown"})(**args)

print(f" ↳ {tc.function.name}({args}) -> {result}")

messages.append({"role": "tool", "tool_call_id": tc.id,

"content": json.dumps(result)})

return "(stopped: max tool rounds reached)"

def demo_tools():

print("\n=== 4. FUNCTION / TOOL CALLING ===========================")

q = ("How many times larger is Tokyo's metro population than Mexico City's? "

"Use the tools, then answer with the ratio to one decimal place.")

print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split()))

def demo_agent():

print("\n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========")

task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), "

"then compute the combined population of the top two and report it. "

"Use the tools for every lookup and sum; never guess numbers.")

ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},

{"role": "user", "content": task}])

print("Final:", " ".join((ans or "").split()))

We connect GLM-5.2 to external tools and build a small tool-using workflow. We define a calculator and a city-population lookup tool, register them in an OpenAI-style tool schema, and create a loop in which the model requests tool calls and receives tool results. We then use this setup for a direct function-calling task and a small multi-step agent that looks up populations, ranks cities, and performs calculations without guessing.

Structured JSON Output and Long-Context Retrieval with GLM-5.2

Copy CodeCopiedUse a different Browser

def tool_calculator(expression: str):

if not re.fullmatch(r"[0-9+\-*/(). %]+", expression or ""):

return {"error": "unsupported characters"}

try: return {"result": eval(expression, {"__builtins__": {}}, {})}

except Exception as e: return {"error": str(e)}

_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,

"sao paulo": 22_400_000, "mexico city": 21_800_000}

def tool_city_population(city: str):

return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())}

{"type": "function", "function": {

"name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",

"parameters": {"type": "object", "properties": {"expression": {"type": "string"}},

"required": ["expression"]}}},

{"type": "function", "function": {

"name": "city_population", "description": "Look up the metro population of a city.",

"parameters": {"type": "object", "properties": {"city": {"type": "string"}},

"required": ["city"]}}},

TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}

def run_tool_loop(messages, max_rounds=6, effort="max"):

"""Full loop: model -> tool_calls -> execute -> feed results back -> repeat."""

for _ in range(max_rounds):

resp = chat(messages, tools=TOOLS, thinking=True, effort=effort,

max_tokens=1500, temperature=0.3)

_track(resp.usage)

m = resp.choices[0].message

if not getattr(m, "tool_calls", None):

return m.content

messages.append({

"role": "assistant", "content": m.content or "",

"tool_calls": [{"id": tc.id, "type": "function",

"function": {"name": tc.function.name,

"arguments": tc.function.arguments}}

for tc in m.tool_calls]})

for tc in m.tool_calls:

try: args = json.loads(tc.function.arguments or "{}")

except json.JSONDecodeError: args = {}

result = TOOL_IMPLS.get(tc.function.name, lambda **k: {"error": "unknown"})(**args)

print(f" ↳ {tc.function.name}({args}) -> {result}")

messages.append({"role": "tool", "tool_call_id": tc.id,

"content": json.dumps(result)})

return "(stopped: max tool rounds reached)"

def demo_tools():

print("\n=== 4. FUNCTION / TOOL CALLING ===========================")

q = ("How many times larger is Tokyo's metro population than Mexico City's? "

"Use the tools, then answer with the ratio to one decimal place.")

print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split()))

def demo_agent():

print("\n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========")

task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), "

"then compute the combined population of the top two and report it. "

"Use the tools for every lookup and sum; never guess numbers.")

ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},

{"role": "user", "content": task}])

print("Final:", " ".join((ans or "").split()))

We focus on reliable, structured output and long-context retrieval. We create a JSON extraction helper, ask the model to return a strict JSON object, and retry once if the first response is not valid JSON. We also build a synthetic long document with a hidden “needle” and send it to GLM-5.2 to check whether the model retrieves the exact launch code from the provided context.

Running All Demos with GLM-5.2 Token and Cost Accounting

Copy CodeCopiedUse a different Browser

def cost_summary():

print("\n=== 8. TOKEN + COST ACCOUNTING ===========================")

cost = _USAGE["in"]/1e6*PRICE_IN_PER_M + _USAGE["out"]/1e6*PRICE_OUT_PER_M

print(f" calls: {_USAGE['calls']} | input: {_USAGE['in']:,} tok | output: {_USAGE['out']:,} tok")

print(f" estimated spend @ ${PRICE_IN_PER_M}/{PRICE_OUT_PER_M} per 1M: ${cost:0.4f}")

DEMOS = [demo_basic, demo_effort, demo_streaming, demo_tools,

demo_agent, demo_structured, demo_long_context]

print(f"Provider={PROVIDER} model={MODEL}")

for fn in DEMOS:

except Exception as e:

print(f" [skipped {fn.__name__}: {type(e).__name__}: {e}]")

cost_summary()

print("\nDone. Tweak PROVIDER / effort / max_tokens and re-run any demo function.")

We finish the tutorial by collecting usage information and running all demos from top to bottom. We calculate the estimated cost from total input and output tokens, then print a compact summary of calls, token counts, and spend. We also use a driver loop so that a single failed demo does not halt the entire notebook, making the tutorial easier to run, debug, and reuse.

In conclusion, we have a practical and reusable workflow for using GLM-5.2 in Python applications. We learned how to control its reasoning behavior, compare different thinking modes, connect it with tools, validate structured outputs, test long-context inputs, and monitor token usage with estimated cost. It provides us a strong starting point for building more advanced systems such as research assistants, document analysis tools, coding agents, long-context retrieval workflows, or API-based reasoning pipelines. We finished with a setup that is lightweight enough for Colab but still close to how we would build with GLM-5.2 in a real project.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval appeared first on MarkTechPost.

compartilhar: