redvault-ai/llama_forge_rs/PLAN.org

#+title: Plan

* TODO for 0.0.1-rc1
- [X] processmanager service
  - [X] spawn task on app startup
  - [X] loop every second
  - [X] start processes
    - [X] query waiting processes from db
    - [X] start them
    - [X] change their status to running
  - [X] stop finished processes in db & remove from RAM registry
    - [X] query status for currently running processes
    - [X] stop those that aren't status=running
    - [X] set their status to finished
- [ ] must have tweaks
  - pass options to model (ngl , path & model)
    - gpu/nogpu
  - model dropdown (ls *.gguf based)
    - size
  - markdown formatting with markdown-rs + set inner html
  - show small backend starter widget icon /button on chat page
  - test faster refresh
  - chat persistence
  - Config.toml
  - package as appimage
  - add model mode
    - amd/rocm/cuda
- [ ] ideas to investigate before release
  - stdout inspection
  - visualize setting generation ? [not really useful once settings are per chat?]

* TODO next steps after 0.0.1-rc1
- markdown formatting
- chat persistence
- backend logs inspector
- multiple chats
- per chat settings/model etc
- configurable ngl
- custom backends via pwd, command & args
- custom backend templates
- prompt templates
- sampling settings
- chat/completion mode?
- transfer planning into issues

* Roadmap
0.1 model selection from dir, switch models
- hardcoded ngl
- llamafile in path or ./llamafile only
- one chat
- simple model selection
- llamafile included templates only
0.2 
- hardcoded inbuilt chat templates
- multiple chatrooms
  - persist settings
    - ngl setting
  - persist history
  - summaries
- extendended backend settings
  - max running? running slots?
- better model selection
  - extract GGUF metadata
- model downloader ?
  - huggingface /api/models hardcoded to my account as owner
  - develop some yalu.toml manifest? ?
- chat templates /completions instead of /chat/completions

* Design for 0.1
- Frontend
  - settings page
    - model dir
  - chat settings drawer
    - model selection (from dir */*.gguf?)
    - chat template (from hardcoded list)
    - start/stop
- Backend
  - Settings (1)
    - model path
  - Chat (1)
    - Template
    - ModelSettigns
      - model
      - ngl
  - BackendProcess (1)
    - status: started -> running -> finished
    - created from chat & saves its args
    - no update, only create&delete
- RunnerBackend
  - keep track which processes are running
  - start/stop processes when needed

* TODO for 0.1
- Settings api
  - #[server] fn update_settings
    - model_dir
- Chat Api
  - #[server] fn update_chat
    - ChatTemplate (llama3, chatml, phi)
    - model path
    - ngl
- BackendProcess api
  - #[server] fn start_process
  - #[server] fn stop_process
  - #[server] fn restart_process ?
- BackendRunner worker
- UI stuff
  - settings page with model_dir
  - drawer on chat
    - settings (model_path & ngl)
    - start/stop
- Package for private release


* TODO Design for backend runners
** TODO
- implement backendconfig CRUD
  - backend tab
- implement starting of a specified backendconfig
  - "running" tab ?
- add simple per-start settings
  - context & ngl
- add model per-start setting
  - needs model settings (ie. download path)
  - probably need global app settings somewhere
- better message formatting
  - markdown conversion
** Newest Synthesis
- 2 Ressources
  - BackendConfig
    - includes state needed to start backend
    - ie. no runtime options like -ctx -m -ngl etc
    - for noparams configs only needed ui is a select dropdown
      - (NO PARAMS !!!!)
        - shipped llamafile
        - llamafile PATH
        - llama.cpp server in PATH ?
      - (not mvp)
        - basic & flexible pwd, cmd, args(prefix)
        - templates for default options (can probably just be in the ui code, auto-filling the form ?)
          - llama.cpp path prebuilt
          - llama.cpp path builder
          - no explicit nix support for now!
  - BackendProcess
    - initialy just start/stop with hardcoded config
  - RunTimeConfig
    - model
    - context etc
** Open Questions
- how to model multiple launched instances ?
  - could have different parameters or models loadead
** Synthesis ?
- model backend as ressource
  - runner can start stop
- build interactor pattern services ?
** (Maybe) better option runner module seperate as a kind of micro subservice
- only startup fn in main, nothing pub apart from that
- server api code stays like a mostly simple crud app
- start background jobs on startub
  - starter/manager
    - reads intended backend state from sqlite
    - has internal state in struct
    - makes internal state agree with db
      - starts backends
      - stops backends
      - etc?
- frontend just reads and writes db via server fns
- other background job for having always up-to-date status for backends ?
  - expose status checker via backendapi interface trait
** (Maybe) stupid option
- continue current plan, start on demand via server_fn request
- how to handle only starting a single backend
  - some in process registry needed ?

* MVP
** Backends
- start on demand
  - simple start/stop
    - as background service
  - simple status via /health
- Options
  - llamafile
    - in $PATH
    - as executable file next to binary, (enables creating a zip which "just works")
  - llama.cpp
    - via nix via path to llama.cpp directory
    - via path to binary
- Settings
  - context
  - gpu layers
  - keep model hardcoded for now

** Chat Prompt Template
- simple template defs to get from chat format (with role) to bare text prompt
  - collect some default templates (chatml/llama3)
- migrate to /completions api
- apply to specific models ?

** Model Selection
- set folder in general settings
- read gguf metadata via gguf crate
- per-model settings (layers? ctx?, vram prediction ?)

** Inference settings (in chat as modal or sth like that)
- set sampler params in chat settings
** Settings hierarchy ?
- per_chat>per_model>per_backend>global
** Setting types ?
- Model loading
  - context
  - gpu layers
- Sampling
  - temperature
- Prompt template


* Settings planning
** Per Backend
*** runner config
- pwd
- cmd
- template for args
  - model
  - chat template
  - infer settings ? ( low prio, should switch to other API, that allows settings these at runtime )
** Per Model
*** offloading layers ?
** per chat
*** inference settings( runtime )

* Settings todo
- start/stop
  - start current backend on demand, just start stop on settings page
  - disable buttons when backend isn ´t running
- only allow llama-cpp/llamafile launch arguments for now

* Next steps (teaser)
- [x] finish basic chat
  - [x] bigger bubbles (use screen, +flex grow?/maybe even grid?)
  - [x] edit history + system prompt
  - [x] regenerate latest response
  # - save history to db (postponed until multichat)
- [ ] backend page
  - [ ] infer sampling settings
  - [ ] running settings (gpu layer, context size etc)
- [ ] model page
  - [ ] set model dir
  - [ ] list by simple filename (& size)
  - [ ] offline metadata (README frontmatter yaml, filename, (gguf crate))
- [ ] chat settings
  - [ ] none for now, single model & settigns et is selected on respective pages
* Next steps (private mvp)
- chatrooms
- settings/model/etc per chatroom, multiple settingss ets

* TODO MVP
- [ ] add test model downloader to nix devshell
- [ ] Backend config via TOML
  - just based on llama.cpp /completion for now
- [ ] Basic chat GUI
  - basic ui with bubbles
  - advanced ui with markdown rendering
    - fix incomplete quotes ?
- [ ] Prompt template & parameters via TOML
- [ ] Basic DB stuff
  - single room history
  - prompt templates via DB
  - parameter management via DB (e.g. temperature)
- [ ] Advanced chat UI
  - Multiple "Rooms"
  - Set prompt & params per room
- [ ] Basic RAG
  - select vector db
    - qdrant ? chroma ?

* TODO Advanced features
- [ ] Backends
  - Backend Runner
    - llamafile
    - llama.cpp nix (via cmd templates ?)
  - Backend API config?
  - Backend Downloader/Installer
- [ ] Inference Param Templates
- [ ] Prompt Templates
- [ ] model library
  - [ ] model downloader
  - [ ] model selector
    - model data extractionf from gguf
  - [ ] quant selector
    - automatic offloading layer selection based on vram
  - [ ] auto-quantize
    - vocab selection
    - quant checkboxes
    - extract progress ETA
    - imatrix generation
    - dataset downloader ? (or just include a default one?)
- [ ] Better RAG
  - [ ] add multiple embedding models
  - [ ] add reranking
- [ ] Generic graph based prompt pre/postprocessing via UI, like ComfyUI
  - [ ] DSL ? Some existing scripting stuff ?
  - [ ] Graph just as visualization, with text-based config
  - [ ] Fancy Graph UI

* TODO Polish
- [ ] Backend Multi-API compat e.g. llama.cpp /completion & /chat/completion
  - has different features (chat/completion has hardcoded prompt template)
  - support only full featured backends for now
  - add chat support here

* TODO Go public
- Rename to YALU ?
- Polish README.md
- Clean history
- Add some more common backends (ollama ?)
- Sync to github
- Announce on /locallama
-												Init

											
										
										
											2024-07-21 02:42:48 +02:00
+								#+title: Plan
 								* TODO for 0.0.1-rc1
 								- [X] processmanager service
 								  - [X] spawn task on app startup
 								  - [X] loop every second
 								  - [X] start processes
 								    - [X] query waiting processes from db
 								    - [X] start them
 								    - [X] change their status to running
 								  - [X] stop finished processes in db & remove from RAM registry
 								    - [X] query status for currently running processes
 								    - [X] stop those that aren't status=running
 								    - [X] set their status to finished
 								- [ ] must have tweaks
 								  - pass options to model (ngl , path & model)
 								    - gpu/nogpu
 								  - model dropdown (ls *.gguf based)
 								    - size
 								  - markdown formatting with markdown-rs + set inner html
 								  - show small backend starter widget icon /button on chat page
 								  - test faster refresh
 								  - chat persistence
 								  - Config.toml
 								  - package as appimage
 								  - add model mode
 								    - amd/rocm/cuda
 								- [ ] ideas to investigate before release
 								  - stdout inspection
 								  - visualize setting generation ? [not really useful once settings are per chat?]
 								* TODO next steps after 0.0.1-rc1
 								- markdown formatting
 								- chat persistence
 								- backend logs inspector
 								- multiple chats
 								- per chat settings/model etc
 								- configurable ngl
 								- custom backends via pwd, command & args
 								- custom backend templates
 								- prompt templates
 								- sampling settings
 								- chat/completion mode?
 								- transfer planning into issues
 								* Roadmap
 .1 model selection from dir, switch models
 								- hardcoded ngl
 								- llamafile in path or ./llamafile only
 								- one chat
 								- simple model selection
 								- llamafile included templates only
 .2
 								- hardcoded inbuilt chat templates
 								- multiple chatrooms
 								  - persist settings
 								    - ngl setting
 								  - persist history
 								  - summaries
 								- extendended backend settings
 								  - max running? running slots?
 								- better model selection
 								  - extract GGUF metadata
 								- model downloader ?
 								  - huggingface /api/models hardcoded to my account as owner
 								  - develop some yalu.toml manifest? ?
 								- chat templates /completions instead of /chat/completions
 								* Design for 0.1
 								- Frontend
 								  - settings page
 								    - model dir
 								  - chat settings drawer
 								    - model selection (from dir */*.gguf?)
 								    - chat template (from hardcoded list)
 								    - start/stop
 								- Backend
 								  - Settings (1)
 								    - model path
 								  - Chat (1)
 								    - Template
 								    - ModelSettigns
 								      - model
 								      - ngl
 								  - BackendProcess (1)
 								    - status: started -> running -> finished
 								    - created from chat & saves its args
 								    - no update, only create&delete
 								- RunnerBackend
 								  - keep track which processes are running
 								  - start/stop processes when needed
 								* TODO for 0.1
 								- Settings api
 								  - #[server] fn update_settings
 								    - model_dir
 								- Chat Api
 								  - #[server] fn update_chat
 								    - ChatTemplate (llama3, chatml, phi)
 								    - model path
 								    - ngl
 								- BackendProcess api
 								  - #[server] fn start_process
 								  - #[server] fn stop_process
 								  - #[server] fn restart_process ?
 								- BackendRunner worker
 								- UI stuff
 								  - settings page with model_dir
 								  - drawer on chat
 								    - settings (model_path & ngl)
 								    - start/stop
 								- Package for private release
 								* TODO Design for backend runners
 								** TODO
 								- implement backendconfig CRUD
 								  - backend tab
 								- implement starting of a specified backendconfig
 								  - "running" tab ?
 								- add simple per-start settings
 								  - context & ngl
 								- add model per-start setting
 								  - needs model settings (ie. download path)
 								  - probably need global app settings somewhere
 								- better message formatting
 								  - markdown conversion
 								** Newest Synthesis
 								- 2 Ressources
 								  - BackendConfig
 								    - includes state needed to start backend
 								    - ie. no runtime options like -ctx -m -ngl etc
 								    - for noparams configs only needed ui is a select dropdown
 								      - (NO PARAMS !!!!)
 								        - shipped llamafile
 								        - llamafile PATH
 								        - llama.cpp server in PATH ?
 								      - (not mvp)
 								        - basic & flexible pwd, cmd, args(prefix)
 								        - templates for default options (can probably just be in the ui code, auto-filling the form ?)
 								          - llama.cpp path prebuilt
 								          - llama.cpp path builder
 								          - no explicit nix support for now!
 								  - BackendProcess
 								    - initialy just start/stop with hardcoded config
 								  - RunTimeConfig
 								    - model
 								    - context etc
 								** Open Questions
 								- how to model multiple launched instances ?
 								  - could have different parameters or models loadead
 								** Synthesis ?
 								- model backend as ressource
 								  - runner can start stop
 								- build interactor pattern services ?
 								** (Maybe) better option runner module seperate as a kind of micro subservice
 								- only startup fn in main, nothing pub apart from that
 								- server api code stays like a mostly simple crud app
 								- start background jobs on startub
 								  - starter/manager
 								    - reads intended backend state from sqlite
 								    - has internal state in struct
 								    - makes internal state agree with db
 								      - starts backends
 								      - stops backends
 								      - etc?
 								- frontend just reads and writes db via server fns
 								- other background job for having always up-to-date status for backends ?
 								  - expose status checker via backendapi interface trait
 								** (Maybe) stupid option
 								- continue current plan, start on demand via server_fn request
 								- how to handle only starting a single backend
 								  - some in process registry needed ?
 								* MVP
 								** Backends
 								- start on demand
 								  - simple start/stop
 								    - as background service
 								  - simple status via /health
 								- Options
 								  - llamafile
 								    - in $PATH
 								    - as executable file next to binary, (enables creating a zip which "just works")
 								  - llama.cpp
 								    - via nix via path to llama.cpp directory
 								    - via path to binary
 								- Settings
 								  - context
 								  - gpu layers
 								  - keep model hardcoded for now
 								** Chat Prompt Template
 								- simple template defs to get from chat format (with role) to bare text prompt
 								  - collect some default templates (chatml/llama3)
 								- migrate to /completions api
 								- apply to specific models ?
 								** Model Selection
 								- set folder in general settings
 								- read gguf metadata via gguf crate
 								- per-model settings (layers? ctx?, vram prediction ?)
 								** Inference settings (in chat as modal or sth like that)
 								- set sampler params in chat settings
 								** Settings hierarchy ?
 								- per_chat>per_model>per_backend>global
 								** Setting types ?
 								- Model loading
 								  - context
 								  - gpu layers
 								- Sampling
 								  - temperature
 								- Prompt template
 								* Settings planning
 								** Per Backend
 								*** runner config
 								- pwd
 								- cmd
 								- template for args
 								  - model
 								  - chat template
 								  - infer settings ? ( low prio, should switch to other API, that allows settings these at runtime )
 								** Per Model
 								*** offloading layers ?
 								** per chat
 								*** inference settings( runtime )
 								* Settings todo
 								- start/stop
 								  - start current backend on demand, just start stop on settings page
 								  - disable buttons when backend isn ´t running
 								- only allow llama-cpp/llamafile launch arguments for now
 								* Next steps (teaser)
 								- [x] finish basic chat
 								  - [x] bigger bubbles (use screen, +flex grow?/maybe even grid?)
 								  - [x] edit history + system prompt
 								  - [x] regenerate latest response
 								  # - save history to db (postponed until multichat)
 								- [ ] backend page
 								  - [ ] infer sampling settings
 								  - [ ] running settings (gpu layer, context size etc)
 								- [ ] model page
 								  - [ ] set model dir
 								  - [ ] list by simple filename (& size)
 								  - [ ] offline metadata (README frontmatter yaml, filename, (gguf crate))
 								- [ ] chat settings
 								  - [ ] none for now, single model & settigns et is selected on respective pages
 								* Next steps (private mvp)
 								- chatrooms
 								- settings/model/etc per chatroom, multiple settingss ets
 								* TODO MVP
 								- [ ] add test model downloader to nix devshell
 								- [ ] Backend config via TOML
 								  - just based on llama.cpp /completion for now
 								- [ ] Basic chat GUI
 								  - basic ui with bubbles
 								  - advanced ui with markdown rendering
 								    - fix incomplete quotes ?
 								- [ ] Prompt template & parameters via TOML
 								- [ ] Basic DB stuff
 								  - single room history
 								  - prompt templates via DB
 								  - parameter management via DB (e.g. temperature)
 								- [ ] Advanced chat UI
 								  - Multiple "Rooms"
 								  - Set prompt & params per room
 								- [ ] Basic RAG
 								  - select vector db
 								    - qdrant ? chroma ?
 								* TODO Advanced features
 								- [ ] Backends
 								  - Backend Runner
 								    - llamafile
 								    - llama.cpp nix (via cmd templates ?)
 								  - Backend API config?
 								  - Backend Downloader/Installer
 								- [ ] Inference Param Templates
 								- [ ] Prompt Templates
 								- [ ] model library
 								  - [ ] model downloader
 								  - [ ] model selector
 								    - model data extractionf from gguf
 								  - [ ] quant selector
 								    - automatic offloading layer selection based on vram
 								  - [ ] auto-quantize
 								    - vocab selection
 								    - quant checkboxes
 								    - extract progress ETA
 								    - imatrix generation
 								    - dataset downloader ? (or just include a default one?)
 								- [ ] Better RAG
 								  - [ ] add multiple embedding models
 								  - [ ] add reranking
 								- [ ] Generic graph based prompt pre/postprocessing via UI, like ComfyUI
 								  - [ ] DSL ? Some existing scripting stuff ?
 								  - [ ] Graph just as visualization, with text-based config
 								  - [ ] Fancy Graph UI
 								* TODO Polish
 								- [ ] Backend Multi-API compat e.g. llama.cpp /completion & /chat/completion
 								  - has different features (chat/completion has hardcoded prompt template)
 								  - support only full featured backends for now
 								  - add chat support here
 								* TODO Go public
 								- Rename to YALU ?
 								- Polish README.md
 								- Clean history
 								- Add some more common backends (ollama ?)
 								- Sync to github
 								- Announce on /locallama