Agentic Assistance via OpenClaw
It has been a busy month, and I recently stumbled upon OpenClaw: a self-hosted personal assistant that can browse the web, use tools, maintain memory, express personality, and even evolve over time. Wanting to try it properly, I went all-in and bought the recommended dedicated Mac Mini M4, a machine solely for the agent to control and use. I mounted it behind my iMac next to the NUC GitHub runner.
That also triggered a larger rebuild of the entire iMac VESA adapter setup. In the end, I centralized Ethernet through a small dedicated 5-port switch mounted at the bottom, so instead of four twisted-pair cables running behind the desk, there is now only a single fiber line.
Back to the actual topic: my initial goal was to use OpenClaw as support for all kinds of operations around the WTF-Model. The idea was simple: monitor and maintain the overall project, make improvements, act on tickets and requests, and maybe even write dedicated documentation. In short, offload work to the agent and free up time for other things.
Five weeks in, I am still busy configuring the setup, establishing the basics, learning AI fundamentals, and making core architectural decisions. The learning curve turned out to be steep across many different topics. One thing that was new to me is how heavily OpenClaw relies on skills: high-level natural-language instructions written the way humans naturally speak. That is very different from low-level config settings and effectively forms a separate instruction layer.
I dug into that and started defining what I actually wanted the agent to know and use. Writing these is very close to prompt engineering and the usual methods of optimizing text for effective LLM ingestion. I ended up formalizing the agent’s capabilities into real operating layers:
missionfor cross-role execution and syncsopsfor infrastructure and administrative tasksqdrant-ragfor grounded retrieval before decisions
That separation mattered quickly. If you do not split concerns into cleanly bounded packets and instead cram everything into the system prompt, as I initially did, problems show up fast. The prompt becomes bloated, ambiguity and contradictions creep in, and hallucinations follow. So a large part of the initial work became iterating on instructions, optimizing them for the system prompt window, and then moving them into dedicated skills that are only loaded when actually needed.
I also tried to self-host an LLM on the M4: first through Ollama, then through MLX, the macOS-optimized bare-metal AI layer. OpenClaw works very well with the OpenAI Completions and ChatCompletions API specification, so integrating self-hosted LLMs through Ollama and MLX was straightforward.
For roughly a week, I benchmarked extensively across many 14B to 27B class models, including Mistral, GPT-OSS, Qwen, and GLM.
I also experimented with so-called draft models and even coded my own MLX wrapper for that. Draft models are smaller, specialized versions of the same model that run alongside the main model and speculate on the next token. That turned out to be genuinely effective: with a suitable draft model, I was able to get roughly 40% more tokens per second. But even when pushing into the 20–30 tokens-per-second range, other factors such as Time to First Token (TTFT) still failed to meet expectations.
After a lot of tinkering and research, my conclusion was this: you need roughly 80 GB+ of VRAM before self-hosting an LLM becomes truly viable for professional workloads, especially if you need a 100k+ context window and a capable 70B parameter model that delivers the necessary quality. OpenClaw itself starts to become somewhat viable for self-hosting around the 30B parameter mark, which I could run, but only with a very small context window.
So the short version is: the hardware I bought, an M4 with 32 GB, is simply not sufficient to run a main LLM while meeting the required non-functional requirements of speed, quality, and reliability.
I know about context compaction and tricks like quantized KV cache, but especially for coding, where you paste entire documents across multiple turns, a 32k or 64k context window is not really viable. OpenClaw also injects a very extensive system prompt, and even if you curate that aggressively, you still end up with around 4,000–5,000 tokens of overhead just for that. So the context window gets consumed quickly, and the last thing you want is to run out of room in the middle of a difficult interaction.
After realizing that, I decided to limit self-hosting to supporting AI services such as text-to-speech, speech-to-text, and a proper embedding model for RAG, while outsourcing the actual heavy reasoning to the cloud. I first tried OpenRouter.ai, which was decent and something I would still recommend to anyone wanting to experiment with different models.
This hybrid setup is currently stable: local Ollama for embedding/stt/tts, and cloud models for heavy reasoning and coding work. Later, I moved directly to the OpenAI API, where I have already spent my first 50 EUR.
I also tried Gemini through Google Cloud Console, only to discover that it lies extremely well and with high confidence, even on details.
Because of its prominence in Google Search results, I had actually been a big fan of Gemini. That changed once it sent me down a completely hallucinated path: I discovered, somewhat in shock, that the assumptions I had been building on for two days were entirely wrong and made up by Gemini. When I saw the same pattern a second time, I stopped using it altogether. No trust.
More generally, I noticed that you really have to instruct models aggressively to verify through manual web search. Left to themselves, they often prefer to fabricate something plausible rather than remain truthful. I therefore decided to force my agents to use the Brave WebSearch API at the start of every task so their information stays up to date and hallucinations remain, hopefully, low.
With OpenClaw itself, I did a lot of experimentation around configuration, extension, and maintainability. Aside from getting into macOS zshell for the first time and learning the built-in system of LaunchAgents, I also built a custom TypeScript plugin for semantic routing. The idea was to hook into an OpenClaw event, decide whether a different model would be more appropriate, and then switch to it dynamically. It sounded good in theory, but in practice it did not work out. The overhead was high, the solution was brittle, and I eventually deprecated it in favor of explicit model defaults and fallbacks per role in the configuration.
Despite all these trials and errors, there has also been real progress.
First, I attached a dedicated external 256 GB workspace SSD to the agent. On it, I set up the canonical knowledge base with strict folders for meetings, decisions, plans, research, operations, and artifacts, plus a shared Qdrant vector store for cross-agent RAG. This runs alongside each agent’s individual built-in SQLite vector memory.
I also successfully gave the agent its own email address. For that, I coded a direct TypeScript plugin integration with IMAP/SMTP tooling for listing, searching, reading, and controlled sending.
I did the same for SMS capability, where I bought a small 5 EUR/month plan and a dedicated WaveShare LTE stick. That now runs as a SIM7600 UART plugin with explicit modem status for Internet fallback, plus list, read, send, and delete actions. I also wired voice-call flows using the local STT/TTS models.
And of course, I got the agent registered in my GitHub organization, Azure tenant, and Steamworks account. Even though I also added Discord and Telegram capability, it was not nearly as easy to get there as it might sound.
Very often, I had to assist the agent with special CAPTCHAs and repeatedly confirm that it was allowed to perform simple operations. Eventually, I ended up editing openclaw.json directly and learning its structure and mechanics in exhaustive detail. Still, once established, the integrations themselves worked surprisingly well.
I also ran into a more serious class of problems. Initially, I instructed the agent to perform some operations on itself, only to discover later that it had effectively suicided itself while trying to fix its own internals. I then had to manually roll it back to a previous state and restart everything myself. That happened more than once, even with explicit warnings such as make sure you do not actually brick yourself.
So in my experience, OpenClaw cannot yet work fully autonomously on its own core configuration. I ended up removing write permissions on critical files and aligning operations with least-privilege guardrails. The agent now has access to everything it needs, but only with limited rights. Even though the Mac Mini is a dedicated machine for the agent, I still placed it behind a separate user account and granted only specific sudoers exceptions for reboot, Homebrew, and a few maintenance commands.
Once I had one agent working reliably, I started extending the idea into a team of seven agents:
Writer, Researcher, Coder, DevOps Engineer, Marketing Expert, Orchestrator, and my initial personal assistant, which also evolved into the Supervisor for the other six.
That only became workable after introducing mission discipline: a fixed daily cadence with all-hands calls and duo syncs, explicit assignment ownership through the orchestrator, standby rules for non-assigned roles, and Qdrant-first evidence retrieval before decisions. At that point, it stopped feeling like one chatbot with many tabs and started feeling more like an actual operating system for coordinated agent work.
In the process of setting all of this up, I also dug much deeper into AI concepts. During that exploration, I created a mapping for the WTF / Fractal that I found genuinely useful, especially around the low-level technical relationship between auto-regressive and diffusion-based approaches to text generation.
This is essentially how an LLM “thinks.” Today, more than 90% of models are autoregressive, meaning they generate tokens sequentially, one after another. Diffusion-based approaches also exist, where tokens can be generated in parallel and refined iteratively. So I created this entry for the WTF in one afternoon and added it directly to the model. The distinction felt very fitting for the global and local domains, so I mapped it that way and like to share a small recording of it here:
To wrap things up for now, I am still not sure how viable the OpenClaw agentic approach will be in practice for long-term support work. I will most likely keep iterating slowly and refine the use of agents step by step. Honestly, this turned out to be much harder than expected, and so far it has mainly incurred costs in both money and time rather than contributing directly to the project.
Still, I will keep you updated, and in future posts I want to shift the focus back more toward the core WTF / Fractal project.