#+title: Plan * TODO for 0.0.1-rc1 - [X] processmanager service - [X] spawn task on app startup - [X] loop every second - [X] start processes - [X] query waiting processes from db - [X] start them - [X] change their status to running - [X] stop finished processes in db & remove from RAM registry - [X] query status for currently running processes - [X] stop those that aren't status=running - [X] set their status to finished - [ ] must have tweaks - pass options to model (ngl , path & model) - gpu/nogpu - model dropdown (ls *.gguf based) - size - markdown formatting with markdown-rs + set inner html - show small backend starter widget icon /button on chat page - test faster refresh - chat persistence - Config.toml - package as appimage - add model mode - amd/rocm/cuda - [ ] ideas to investigate before release - stdout inspection - visualize setting generation ? [not really useful once settings are per chat?] * TODO next steps after 0.0.1-rc1 - markdown formatting - chat persistence - backend logs inspector - multiple chats - per chat settings/model etc - configurable ngl - custom backends via pwd, command & args - custom backend templates - prompt templates - sampling settings - chat/completion mode? - transfer planning into issues * Roadmap 0.1 model selection from dir, switch models - hardcoded ngl - llamafile in path or ./llamafile only - one chat - simple model selection - llamafile included templates only 0.2 - hardcoded inbuilt chat templates - multiple chatrooms - persist settings - ngl setting - persist history - summaries - extendended backend settings - max running? running slots? - better model selection - extract GGUF metadata - model downloader ? - huggingface /api/models hardcoded to my account as owner - develop some yalu.toml manifest? ? - chat templates /completions instead of /chat/completions * Design for 0.1 - Frontend - settings page - model dir - chat settings drawer - model selection (from dir */*.gguf?) - chat template (from hardcoded list) - start/stop - Backend - Settings (1) - model path - Chat (1) - Template - ModelSettigns - model - ngl - BackendProcess (1) - status: started -> running -> finished - created from chat & saves its args - no update, only create&delete - RunnerBackend - keep track which processes are running - start/stop processes when needed * TODO for 0.1 - Settings api - #[server] fn update_settings - model_dir - Chat Api - #[server] fn update_chat - ChatTemplate (llama3, chatml, phi) - model path - ngl - BackendProcess api - #[server] fn start_process - #[server] fn stop_process - #[server] fn restart_process ? - BackendRunner worker - UI stuff - settings page with model_dir - drawer on chat - settings (model_path & ngl) - start/stop - Package for private release * TODO Design for backend runners ** TODO - implement backendconfig CRUD - backend tab - implement starting of a specified backendconfig - "running" tab ? - add simple per-start settings - context & ngl - add model per-start setting - needs model settings (ie. download path) - probably need global app settings somewhere - better message formatting - markdown conversion ** Newest Synthesis - 2 Ressources - BackendConfig - includes state needed to start backend - ie. no runtime options like -ctx -m -ngl etc - for noparams configs only needed ui is a select dropdown - (NO PARAMS !!!!) - shipped llamafile - llamafile PATH - llama.cpp server in PATH ? - (not mvp) - basic & flexible pwd, cmd, args(prefix) - templates for default options (can probably just be in the ui code, auto-filling the form ?) - llama.cpp path prebuilt - llama.cpp path builder - no explicit nix support for now! - BackendProcess - initialy just start/stop with hardcoded config - RunTimeConfig - model - context etc ** Open Questions - how to model multiple launched instances ? - could have different parameters or models loadead ** Synthesis ? - model backend as ressource - runner can start stop - build interactor pattern services ? ** (Maybe) better option runner module seperate as a kind of micro subservice - only startup fn in main, nothing pub apart from that - server api code stays like a mostly simple crud app - start background jobs on startub - starter/manager - reads intended backend state from sqlite - has internal state in struct - makes internal state agree with db - starts backends - stops backends - etc? - frontend just reads and writes db via server fns - other background job for having always up-to-date status for backends ? - expose status checker via backendapi interface trait ** (Maybe) stupid option - continue current plan, start on demand via server_fn request - how to handle only starting a single backend - some in process registry needed ? * MVP ** Backends - start on demand - simple start/stop - as background service - simple status via /health - Options - llamafile - in $PATH - as executable file next to binary, (enables creating a zip which "just works") - llama.cpp - via nix via path to llama.cpp directory - via path to binary - Settings - context - gpu layers - keep model hardcoded for now ** Chat Prompt Template - simple template defs to get from chat format (with role) to bare text prompt - collect some default templates (chatml/llama3) - migrate to /completions api - apply to specific models ? ** Model Selection - set folder in general settings - read gguf metadata via gguf crate - per-model settings (layers? ctx?, vram prediction ?) ** Inference settings (in chat as modal or sth like that) - set sampler params in chat settings ** Settings hierarchy ? - per_chat>per_model>per_backend>global ** Setting types ? - Model loading - context - gpu layers - Sampling - temperature - Prompt template * Settings planning ** Per Backend *** runner config - pwd - cmd - template for args - model - chat template - infer settings ? ( low prio, should switch to other API, that allows settings these at runtime ) ** Per Model *** offloading layers ? ** per chat *** inference settings( runtime ) * Settings todo - start/stop - start current backend on demand, just start stop on settings page - disable buttons when backend isn ´t running - only allow llama-cpp/llamafile launch arguments for now * Next steps (teaser) - [x] finish basic chat - [x] bigger bubbles (use screen, +flex grow?/maybe even grid?) - [x] edit history + system prompt - [x] regenerate latest response # - save history to db (postponed until multichat) - [ ] backend page - [ ] infer sampling settings - [ ] running settings (gpu layer, context size etc) - [ ] model page - [ ] set model dir - [ ] list by simple filename (& size) - [ ] offline metadata (README frontmatter yaml, filename, (gguf crate)) - [ ] chat settings - [ ] none for now, single model & settigns et is selected on respective pages * Next steps (private mvp) - chatrooms - settings/model/etc per chatroom, multiple settingss ets * TODO MVP - [ ] add test model downloader to nix devshell - [ ] Backend config via TOML - just based on llama.cpp /completion for now - [ ] Basic chat GUI - basic ui with bubbles - advanced ui with markdown rendering - fix incomplete quotes ? - [ ] Prompt template & parameters via TOML - [ ] Basic DB stuff - single room history - prompt templates via DB - parameter management via DB (e.g. temperature) - [ ] Advanced chat UI - Multiple "Rooms" - Set prompt & params per room - [ ] Basic RAG - select vector db - qdrant ? chroma ? * TODO Advanced features - [ ] Backends - Backend Runner - llamafile - llama.cpp nix (via cmd templates ?) - Backend API config? - Backend Downloader/Installer - [ ] Inference Param Templates - [ ] Prompt Templates - [ ] model library - [ ] model downloader - [ ] model selector - model data extractionf from gguf - [ ] quant selector - automatic offloading layer selection based on vram - [ ] auto-quantize - vocab selection - quant checkboxes - extract progress ETA - imatrix generation - dataset downloader ? (or just include a default one?) - [ ] Better RAG - [ ] add multiple embedding models - [ ] add reranking - [ ] Generic graph based prompt pre/postprocessing via UI, like ComfyUI - [ ] DSL ? Some existing scripting stuff ? - [ ] Graph just as visualization, with text-based config - [ ] Fancy Graph UI * TODO Polish - [ ] Backend Multi-API compat e.g. llama.cpp /completion & /chat/completion - has different features (chat/completion has hardcoded prompt template) - support only full featured backends for now - add chat support here * TODO Go public - Rename to YALU ? - Polish README.md - Clean history - Add some more common backends (ollama ?) - Sync to github - Announce on /locallama