redvault-ai/llama_forge_rs/PLAN.org

#+title: Plan

* TODO for 0.0.1-rc1
- [X] processmanager service
  - [X] spawn task on app startup
  - [X] loop every second
  - [X] start processes
    - [X] query waiting processes from db
    - [X] start them
    - [X] change their status to running
  - [X] stop finished processes in db & remove from RAM registry
    - [X] query status for currently running processes
    - [X] stop those that aren't status=running
    - [X] set their status to finished
- [ ] must have tweaks
  - pass options to model (ngl , path & model)
    - gpu/nogpu
  - model dropdown (ls *.gguf based)
    - size
  - markdown formatting with markdown-rs + set inner html
  - show small backend starter widget icon /button on chat page
  - test faster refresh
  - chat persistence
  - Config.toml
  - package as appimage
  - add model mode
    - amd/rocm/cuda
- [ ] ideas to investigate before release
  - stdout inspection
  - visualize setting generation ? [not really useful once settings are per chat?]

* TODO next steps after 0.0.1-rc1
- markdown formatting
- chat persistence
- backend logs inspector
- multiple chats
- per chat settings/model etc
- configurable ngl
- custom backends via pwd, command & args
- custom backend templates
- prompt templates
- sampling settings
- chat/completion mode?
- transfer planning into issues

* Roadmap
0.1 model selection from dir, switch models
- hardcoded ngl
- llamafile in path or ./llamafile only
- one chat
- simple model selection
- llamafile included templates only
0.2
- hardcoded inbuilt chat templates
- multiple chatrooms
  - persist settings
    - ngl setting
  - persist history
  - summaries
- extendended backend settings
  - max running? running slots?
- better model selection
  - extract GGUF metadata
- model downloader ?
  - huggingface /api/models hardcoded to my account as owner
  - develop some yalu.toml manifest? ?
- chat templates /completions instead of /chat/completions

* Design for 0.1
- Frontend
  - settings page
    - model dir
  - chat settings drawer
    - model selection (from dir */*.gguf?)
    - chat template (from hardcoded list)
    - start/stop
- Backend
  - Settings (1)
    - model path
  - Chat (1)
    - Template
    - ModelSettigns
      - model
      - ngl
  - BackendProcess (1)
    - status: started -> running -> finished
    - created from chat & saves its args
    - no update, only create&delete
- RunnerBackend
  - keep track which processes are running
  - start/stop processes when needed

* TODO for 0.1
- Settings api
  - #[server] fn update_settings
    - model_dir
- Chat Api
  - #[server] fn update_chat
    - ChatTemplate (llama3, chatml, phi)
    - model path
    - ngl
- BackendProcess api
  - #[server] fn start_process
  - #[server] fn stop_process
  - #[server] fn restart_process ?
- BackendRunner worker
- UI stuff
  - settings page with model_dir
  - drawer on chat
    - settings (model_path & ngl)
    - start/stop
- Package for private release


* TODO Design for backend runners
** TODO
- implement backendconfig CRUD
  - backend tab
- implement starting of a specified backendconfig
  - "running" tab ?
- add simple per-start settings
  - context & ngl
- add model per-start setting
  - needs model settings (ie. download path)
  - probably need global app settings somewhere
- better message formatting
  - markdown conversion
** Newest Synthesis
- 2 Ressources
  - BackendConfig
    - includes state needed to start backend
    - ie. no runtime options like -ctx -m -ngl etc
    - for noparams configs only needed ui is a select dropdown
      - (NO PARAMS !!!!)
        - shipped llamafile
        - llamafile PATH
        - llama.cpp server in PATH ?
      - (not mvp)
        - basic & flexible pwd, cmd, args(prefix)
        - templates for default options (can probably just be in the ui code, auto-filling the form ?)
          - llama.cpp path prebuilt
          - llama.cpp path builder
          - no explicit nix support for now!
  - BackendProcess
    - initialy just start/stop with hardcoded config
  - RunTimeConfig
    - model
    - context etc
** Open Questions
- how to model multiple launched instances ?
  - could have different parameters or models loadead
** Synthesis ?
- model backend as ressource
  - runner can start stop
- build interactor pattern services ?
** (Maybe) better option runner module seperate as a kind of micro subservice
- only startup fn in main, nothing pub apart from that
- server api code stays like a mostly simple crud app
- start background jobs on startub
  - starter/manager
    - reads intended backend state from sqlite
    - has internal state in struct
    - makes internal state agree with db
      - starts backends
      - stops backends
      - etc?
- frontend just reads and writes db via server fns
- other background job for having always up-to-date status for backends ?
  - expose status checker via backendapi interface trait
** (Maybe) stupid option
- continue current plan, start on demand via server_fn request
- how to handle only starting a single backend
  - some in process registry needed ?

* MVP
** Backends
- start on demand
  - simple start/stop
    - as background service
  - simple status via /health
- Options
  - llamafile
    - in $PATH
    - as executable file next to binary, (enables creating a zip which "just works")
  - llama.cpp
    - via nix via path to llama.cpp directory
    - via path to binary
- Settings
  - context
  - gpu layers
  - keep model hardcoded for now

** Chat Prompt Template
- simple template defs to get from chat format (with role) to bare text prompt
  - collect some default templates (chatml/llama3)
- migrate to /completions api
- apply to specific models ?

** Model Selection
- set folder in general settings
- read gguf metadata via gguf crate
- per-model settings (layers? ctx?, vram prediction ?)

** Inference settings (in chat as modal or sth like that)
- set sampler params in chat settings
** Settings hierarchy ?
- per_chat>per_model>per_backend>global
** Setting types ?
- Model loading
  - context
  - gpu layers
- Sampling
  - temperature
- Prompt template


* Settings planning
** Per Backend
*** runner config
- pwd
- cmd
- template for args
  - model
  - chat template
  - infer settings ? ( low prio, should switch to other API, that allows settings these at runtime )
** Per Model
*** offloading layers ?
** per chat
*** inference settings( runtime )

* Settings todo
- start/stop
  - start current backend on demand, just start stop on settings page
  - disable buttons when backend isn ´t running
- only allow llama-cpp/llamafile launch arguments for now

* Next steps (teaser)
- [x] finish basic chat
  - [x] bigger bubbles (use screen, +flex grow?/maybe even grid?)
  - [x] edit history + system prompt
  - [x] regenerate latest response
  # - save history to db (postponed until multichat)
- [ ] backend page
  - [ ] infer sampling settings
  - [ ] running settings (gpu layer, context size etc)
- [ ] model page
  - [ ] set model dir
  - [ ] list by simple filename (& size)
  - [ ] offline metadata (README frontmatter yaml, filename, (gguf crate))
- [ ] chat settings
  - [ ] none for now, single model & settigns et is selected on respective pages
* Next steps (private mvp)
- chatrooms
- settings/model/etc per chatroom, multiple settingss ets

* TODO MVP
- [ ] add test model downloader to nix devshell
- [ ] Backend config via TOML
  - just based on llama.cpp /completion for now
- [ ] Basic chat GUI
  - basic ui with bubbles
  - advanced ui with markdown rendering
    - fix incomplete quotes ?
- [ ] Prompt template & parameters via TOML
- [ ] Basic DB stuff
  - single room history
  - prompt templates via DB
  - parameter management via DB (e.g. temperature)
- [ ] Advanced chat UI
  - Multiple "Rooms"
  - Set prompt & params per room
- [ ] Basic RAG
  - select vector db
    - qdrant ? chroma ?

* TODO Advanced features
- [ ] Backends
  - Backend Runner
    - llamafile
    - llama.cpp nix (via cmd templates ?)
  - Backend API config?
  - Backend Downloader/Installer
- [ ] Inference Param Templates
- [ ] Prompt Templates
- [ ] model library
  - [ ] model downloader
  - [ ] model selector
    - model data extractionf from gguf
  - [ ] quant selector
    - automatic offloading layer selection based on vram
  - [ ] auto-quantize
    - vocab selection
    - quant checkboxes
    - extract progress ETA
    - imatrix generation
    - dataset downloader ? (or just include a default one?)
- [ ] Better RAG
  - [ ] add multiple embedding models
  - [ ] add reranking
- [ ] Generic graph based prompt pre/postprocessing via UI, like ComfyUI
  - [ ] DSL ? Some existing scripting stuff ?
  - [ ] Graph just as visualization, with text-based config
  - [ ] Fancy Graph UI

* TODO Polish
- [ ] Backend Multi-API compat e.g. llama.cpp /completion & /chat/completion
  - has different features (chat/completion has hardcoded prompt template)
  - support only full featured backends for now
  - add chat support here

* TODO Go public
- Rename to YALU ?
- Polish README.md
- Clean history
- Add some more common backends (ollama ?)
- Sync to github
- Announce on /locallama