redvault-ai/llama_forge_rs/PLAN.org

318 lines
8.9 KiB
Org Mode
Raw Normal View History

2024-07-21 02:42:48 +02:00
#+title: Plan
* TODO for 0.0.1-rc1
- [X] processmanager service
- [X] spawn task on app startup
- [X] loop every second
- [X] start processes
- [X] query waiting processes from db
- [X] start them
- [X] change their status to running
- [X] stop finished processes in db & remove from RAM registry
- [X] query status for currently running processes
- [X] stop those that aren't status=running
- [X] set their status to finished
- [ ] must have tweaks
- pass options to model (ngl , path & model)
- gpu/nogpu
- model dropdown (ls *.gguf based)
- size
- markdown formatting with markdown-rs + set inner html
- show small backend starter widget icon /button on chat page
- test faster refresh
- chat persistence
- Config.toml
- package as appimage
- add model mode
- amd/rocm/cuda
- [ ] ideas to investigate before release
- stdout inspection
- visualize setting generation ? [not really useful once settings are per chat?]
* TODO next steps after 0.0.1-rc1
- markdown formatting
- chat persistence
- backend logs inspector
- multiple chats
- per chat settings/model etc
- configurable ngl
- custom backends via pwd, command & args
- custom backend templates
- prompt templates
- sampling settings
- chat/completion mode?
- transfer planning into issues
* Roadmap
0.1 model selection from dir, switch models
- hardcoded ngl
- llamafile in path or ./llamafile only
- one chat
- simple model selection
- llamafile included templates only
0.2
- hardcoded inbuilt chat templates
- multiple chatrooms
- persist settings
- ngl setting
- persist history
- summaries
- extendended backend settings
- max running? running slots?
- better model selection
- extract GGUF metadata
- model downloader ?
- huggingface /api/models hardcoded to my account as owner
- develop some yalu.toml manifest? ?
- chat templates /completions instead of /chat/completions
* Design for 0.1
- Frontend
- settings page
- model dir
- chat settings drawer
- model selection (from dir */*.gguf?)
- chat template (from hardcoded list)
- start/stop
- Backend
- Settings (1)
- model path
- Chat (1)
- Template
- ModelSettigns
- model
- ngl
- BackendProcess (1)
- status: started -> running -> finished
- created from chat & saves its args
- no update, only create&delete
- RunnerBackend
- keep track which processes are running
- start/stop processes when needed
* TODO for 0.1
- Settings api
- #[server] fn update_settings
- model_dir
- Chat Api
- #[server] fn update_chat
- ChatTemplate (llama3, chatml, phi)
- model path
- ngl
- BackendProcess api
- #[server] fn start_process
- #[server] fn stop_process
- #[server] fn restart_process ?
- BackendRunner worker
- UI stuff
- settings page with model_dir
- drawer on chat
- settings (model_path & ngl)
- start/stop
- Package for private release
* TODO Design for backend runners
** TODO
- implement backendconfig CRUD
- backend tab
- implement starting of a specified backendconfig
- "running" tab ?
- add simple per-start settings
- context & ngl
- add model per-start setting
- needs model settings (ie. download path)
- probably need global app settings somewhere
- better message formatting
- markdown conversion
** Newest Synthesis
- 2 Ressources
- BackendConfig
- includes state needed to start backend
- ie. no runtime options like -ctx -m -ngl etc
- for noparams configs only needed ui is a select dropdown
- (NO PARAMS !!!!)
- shipped llamafile
- llamafile PATH
- llama.cpp server in PATH ?
- (not mvp)
- basic & flexible pwd, cmd, args(prefix)
- templates for default options (can probably just be in the ui code, auto-filling the form ?)
- llama.cpp path prebuilt
- llama.cpp path builder
- no explicit nix support for now!
- BackendProcess
- initialy just start/stop with hardcoded config
- RunTimeConfig
- model
- context etc
** Open Questions
- how to model multiple launched instances ?
- could have different parameters or models loadead
** Synthesis ?
- model backend as ressource
- runner can start stop
- build interactor pattern services ?
** (Maybe) better option runner module seperate as a kind of micro subservice
- only startup fn in main, nothing pub apart from that
- server api code stays like a mostly simple crud app
- start background jobs on startub
- starter/manager
- reads intended backend state from sqlite
- has internal state in struct
- makes internal state agree with db
- starts backends
- stops backends
- etc?
- frontend just reads and writes db via server fns
- other background job for having always up-to-date status for backends ?
- expose status checker via backendapi interface trait
** (Maybe) stupid option
- continue current plan, start on demand via server_fn request
- how to handle only starting a single backend
- some in process registry needed ?
* MVP
** Backends
- start on demand
- simple start/stop
- as background service
- simple status via /health
- Options
- llamafile
- in $PATH
- as executable file next to binary, (enables creating a zip which "just works")
- llama.cpp
- via nix via path to llama.cpp directory
- via path to binary
- Settings
- context
- gpu layers
- keep model hardcoded for now
** Chat Prompt Template
- simple template defs to get from chat format (with role) to bare text prompt
- collect some default templates (chatml/llama3)
- migrate to /completions api
- apply to specific models ?
** Model Selection
- set folder in general settings
- read gguf metadata via gguf crate
- per-model settings (layers? ctx?, vram prediction ?)
** Inference settings (in chat as modal or sth like that)
- set sampler params in chat settings
** Settings hierarchy ?
- per_chat>per_model>per_backend>global
** Setting types ?
- Model loading
- context
- gpu layers
- Sampling
- temperature
- Prompt template
* Settings planning
** Per Backend
*** runner config
- pwd
- cmd
- template for args
- model
- chat template
- infer settings ? ( low prio, should switch to other API, that allows settings these at runtime )
** Per Model
*** offloading layers ?
** per chat
*** inference settings( runtime )
* Settings todo
- start/stop
- start current backend on demand, just start stop on settings page
- disable buttons when backend isn ´t running
- only allow llama-cpp/llamafile launch arguments for now
* Next steps (teaser)
- [x] finish basic chat
- [x] bigger bubbles (use screen, +flex grow?/maybe even grid?)
- [x] edit history + system prompt
- [x] regenerate latest response
# - save history to db (postponed until multichat)
- [ ] backend page
- [ ] infer sampling settings
- [ ] running settings (gpu layer, context size etc)
- [ ] model page
- [ ] set model dir
- [ ] list by simple filename (& size)
- [ ] offline metadata (README frontmatter yaml, filename, (gguf crate))
- [ ] chat settings
- [ ] none for now, single model & settigns et is selected on respective pages
* Next steps (private mvp)
- chatrooms
- settings/model/etc per chatroom, multiple settingss ets
* TODO MVP
- [ ] add test model downloader to nix devshell
- [ ] Backend config via TOML
- just based on llama.cpp /completion for now
- [ ] Basic chat GUI
- basic ui with bubbles
- advanced ui with markdown rendering
- fix incomplete quotes ?
- [ ] Prompt template & parameters via TOML
- [ ] Basic DB stuff
- single room history
- prompt templates via DB
- parameter management via DB (e.g. temperature)
- [ ] Advanced chat UI
- Multiple "Rooms"
- Set prompt & params per room
- [ ] Basic RAG
- select vector db
- qdrant ? chroma ?
* TODO Advanced features
- [ ] Backends
- Backend Runner
- llamafile
- llama.cpp nix (via cmd templates ?)
- Backend API config?
- Backend Downloader/Installer
- [ ] Inference Param Templates
- [ ] Prompt Templates
- [ ] model library
- [ ] model downloader
- [ ] model selector
- model data extractionf from gguf
- [ ] quant selector
- automatic offloading layer selection based on vram
- [ ] auto-quantize
- vocab selection
- quant checkboxes
- extract progress ETA
- imatrix generation
- dataset downloader ? (or just include a default one?)
- [ ] Better RAG
- [ ] add multiple embedding models
- [ ] add reranking
- [ ] Generic graph based prompt pre/postprocessing via UI, like ComfyUI
- [ ] DSL ? Some existing scripting stuff ?
- [ ] Graph just as visualization, with text-based config
- [ ] Fancy Graph UI
* TODO Polish
- [ ] Backend Multi-API compat e.g. llama.cpp /completion & /chat/completion
- has different features (chat/completion has hardcoded prompt template)
- support only full featured backends for now
- add chat support here
* TODO Go public
- Rename to YALU ?
- Polish README.md
- Clean history
- Add some more common backends (ollama ?)
- Sync to github
- Announce on /locallama