All functions listed in this document are safe to call from the main thread and all callbacks will be run on the main thread, unless there are explicit instructions or explanations.
ModelRunner
A ModelRunner represents a loaded model instance. The SDK returns concrete ModelRunner implementations, but your code only needs the protocol surface:
public protocol ModelRunner {
func createConversation(systemPrompt: String?) -> Conversation
func createConversationFromHistory(history: [ChatMessage]) -> Conversation
func generateResponse(
conversation: Conversation,
generationOptions: GenerationOptions?,
onResponseCallback: @escaping (MessageResponse) -> Void,
onErrorCallback: ((LeapError) -> Void)?
) -> GenerationHandler
func unload() async
var modelId: String { get }
}
Lifecycle
- Create conversations using
createConversation(systemPrompt:) or createConversationFromHistory(history:).
- Hold a strong reference to the
ModelRunner for as long as you need to perform generations.
- Call
unload() when you are done to release native resources (optional, happens automatically on deinit).
- Access
modelId to identify the loaded model (for analytics, debugging, or UI labels).
Low-level generation API
generateResponse(...) drives generation with callbacks and returns a GenerationHandler you can store to cancel the run. Most apps call the higher-level streaming helpers on Conversation, but you can invoke this method directly when you need fine-grained control (for example, integrating with custom async primitives).
let handler = runner.generateResponse(
conversation: conversation,
generationOptions: options,
onResponseCallback: { message in
// Handle MessageResponse values here
},
onErrorCallback: { error in
// Handle LeapError
}
)
// Stop generation early if needed
handler.stop()
GenerationHandler
public protocol GenerationHandler: Sendable {
func stop()
}
The handler returned by ModelRunner.generateResponse or Conversation.generateResponse(..., onResponse:) lets you cancel generation without tearing down the conversation.
Conversation
Conversation tracks chat state and provides streaming helpers built on top of the model runner.
public class Conversation {
public let modelRunner: ModelRunner
public private(set) var history: [ChatMessage]
public private(set) var functions: [LeapFunction]
public private(set) var isGenerating: Bool
public init(modelRunner: ModelRunner, history: [ChatMessage])
public func registerFunction(_ function: LeapFunction)
public func exportToJSON() throws -> [[String: Any]]
public func generateResponse(
userTextMessage: String,
generationOptions: GenerationOptions? = nil
) -> AsyncThrowingStream<MessageResponse, Error>
public func generateResponse(
message: ChatMessage,
generationOptions: GenerationOptions? = nil
) -> AsyncThrowingStream<MessageResponse, Error>
@discardableResult
public func generateResponse(
message: ChatMessage,
generationOptions: GenerationOptions? = nil,
onResponse: @escaping (MessageResponse) -> Void
) -> GenerationHandler?
}
Properties
history: Copy of the accumulated chat messages. The SDK appends the assistant reply when a generation finishes successfully.
functions: Functions registered via registerFunction(_:) for function calling.
isGenerating: Boolean flag indicating whether a generation is currently running. Attempts to start a new generation while this is true immediately finish with an empty stream (or nil handler for the callback variant).
Streaming Convenience
The most common pattern is to use the async-stream helpers:
let user = ChatMessage(role: .user, content: [.text("Hello! What can you do?")])
Task {
do {
for try await response in conversation.generateResponse(
message: user,
generationOptions: GenerationOptions(temperature: 0.7)
) {
switch response {
case .chunk(let delta):
print(delta, terminator: "")
case .reasoningChunk(let thought):
print("Reasoning:", thought)
case .functionCall(let calls):
handleFunctionCalls(calls)
case .audioSample(let samples, let sampleRate):
audioRenderer.enqueue(samples, sampleRate: sampleRate)
case .complete(let completion):
let text = completion.message.content.compactMap { item in
if case .text(let value) = item { return value }
return nil
}.joined()
print("\nComplete:", text)
if let stats = completion.stats {
print("Prompt tokens: \(stats.promptTokens), completions: \(stats.completionTokens)")
}
}
}
} catch {
print("Generation failed: \(error)")
}
}
Cancelling the task that iterates the stream stops generation and cleans up native resources.
Callback Convenience
Use generateResponse(message:onResponse:) when you prefer callbacks or need to integrate with imperative UI components:
let handler = conversation.generateResponse(message: user) { response in
updateUI(with: response)
}
// Later
handler?.stop()
If a generation is already running, the method returns nil and emits a .complete message with finishReason == .stop via the callback.
The callback overload does not surface generation errors. Use the async-stream helper or call
ModelRunner.generateResponse with onErrorCallback when you need error handling.
Export Chat History
exportToJSON() serializes the conversation history into a [[String: Any]] payload that mirrors OpenAIβs chat-completions format. This is useful for persistence, analytics, or debugging tools.
MessageResponse
public enum MessageResponse {
case chunk(String)
case reasoningChunk(String)
case audioSample(samples: [Float], sampleRate: Int)
case functionCall([LeapFunctionCall])
case complete(MessageCompletion)
}
public struct MessageCompletion {
public let message: ChatMessage
public let finishReason: GenerationFinishReason
public let stats: GenerationStats?
public var info: GenerationCompleteInfo { get }
}
public struct GenerationCompleteInfo {
public let finishReason: GenerationFinishReason
public let stats: GenerationStats?
}
public struct GenerationStats {
public var promptTokens: UInt64
public var completionTokens: UInt64
public var totalTokens: UInt64
public var tokenPerSecond: Float
}
chunk: Partial assistant text emitted during streaming.
reasoningChunk: Model reasoning tokens wrapped between <think> / </think> (only for models that expose reasoning traces).
audioSample: PCM audio frames streamed from audio-capable checkpoints. Feed them into an audio renderer or buffer for later playback.
functionCall: One or more function/tool invocations requested by the model. See the Function Calling guide.
complete: Signals the end of generation. Access the assembled assistant reply through completion.message. Stats and finish reason live on the completion object; completion.info is provided for backward compatibility.
Errors surfaced during streaming are delivered through the thrown error of AsyncThrowingStream, or via the onErrorCallback closure when using the lower-level API.
GenerationOptions
Tune generation behavior with GenerationOptions.
public struct GenerationOptions {
public var temperature: Float?
public var topP: Float?
public var minP: Float?
public var repetitionPenalty: Float?
public var jsonSchemaConstraint: String?
public var functionCallParser: LeapFunctionCallParserProtocol?
public init(
temperature: Float? = nil,
topP: Float? = nil,
minP: Float? = nil,
repetitionPenalty: Float? = nil,
jsonSchemaConstraint: String? = nil,
functionCallParser: LeapFunctionCallParserProtocol? = LFMFunctionCallParser()
)
}
- Leave a field as
nil to fall back to the defaults packaged with the model bundle.
functionCallParser controls how tool-call tokens are parsed. LFMFunctionCallParser (the default) handles Liquid Foundation Model Pythonic function calling. Supply HermesFunctionCallParser() for Hermes/Qwen3 formats, or set the parser to nil to receive raw tool-call text in MessageResponse.chunk.
jsonSchemaConstraint activates constrained generation. Use setResponseFormat(type:) to populate it from a type annotated with the @Generatable macro.
extension GenerationOptions {
public mutating func setResponseFormat<T: GeneratableType>(type: T.Type) throws {
self.jsonSchemaConstraint = try JSONSchemaGenerator.getJSONSchema(for: type)
}
}
var options = GenerationOptions(temperature: 0.6, topP: 0.9)
try options.setResponseFormat(type: CityFact.self)
for try await response in conversation.generateResponse(
message: user,
generationOptions: options
) {
// Handle structured output
}
LiquidInferenceEngineRunner exposes advanced utilities such as getPromptTokensSize(messages:addBosToken:) for applications that need to budget tokens ahead of time. These methods are backend-specific and may be elevated to the ModelRunner protocol in a future release.