redvault-ai/llama_forge_rs/PLAN.org
2024-07-21 02:42:48 +02:00

317 lines
8.9 KiB
Org Mode
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#+title: Plan
* TODO for 0.0.1-rc1
- [X] processmanager service
- [X] spawn task on app startup
- [X] loop every second
- [X] start processes
- [X] query waiting processes from db
- [X] start them
- [X] change their status to running
- [X] stop finished processes in db & remove from RAM registry
- [X] query status for currently running processes
- [X] stop those that aren't status=running
- [X] set their status to finished
- [ ] must have tweaks
- pass options to model (ngl , path & model)
- gpu/nogpu
- model dropdown (ls *.gguf based)
- size
- markdown formatting with markdown-rs + set inner html
- show small backend starter widget icon /button on chat page
- test faster refresh
- chat persistence
- Config.toml
- package as appimage
- add model mode
- amd/rocm/cuda
- [ ] ideas to investigate before release
- stdout inspection
- visualize setting generation ? [not really useful once settings are per chat?]
* TODO next steps after 0.0.1-rc1
- markdown formatting
- chat persistence
- backend logs inspector
- multiple chats
- per chat settings/model etc
- configurable ngl
- custom backends via pwd, command & args
- custom backend templates
- prompt templates
- sampling settings
- chat/completion mode?
- transfer planning into issues
* Roadmap
0.1 model selection from dir, switch models
- hardcoded ngl
- llamafile in path or ./llamafile only
- one chat
- simple model selection
- llamafile included templates only
0.2
- hardcoded inbuilt chat templates
- multiple chatrooms
- persist settings
- ngl setting
- persist history
- summaries
- extendended backend settings
- max running? running slots?
- better model selection
- extract GGUF metadata
- model downloader ?
- huggingface /api/models hardcoded to my account as owner
- develop some yalu.toml manifest? ?
- chat templates /completions instead of /chat/completions
* Design for 0.1
- Frontend
- settings page
- model dir
- chat settings drawer
- model selection (from dir */*.gguf?)
- chat template (from hardcoded list)
- start/stop
- Backend
- Settings (1)
- model path
- Chat (1)
- Template
- ModelSettigns
- model
- ngl
- BackendProcess (1)
- status: started -> running -> finished
- created from chat & saves its args
- no update, only create&delete
- RunnerBackend
- keep track which processes are running
- start/stop processes when needed
* TODO for 0.1
- Settings api
- #[server] fn update_settings
- model_dir
- Chat Api
- #[server] fn update_chat
- ChatTemplate (llama3, chatml, phi)
- model path
- ngl
- BackendProcess api
- #[server] fn start_process
- #[server] fn stop_process
- #[server] fn restart_process ?
- BackendRunner worker
- UI stuff
- settings page with model_dir
- drawer on chat
- settings (model_path & ngl)
- start/stop
- Package for private release
* TODO Design for backend runners
** TODO
- implement backendconfig CRUD
- backend tab
- implement starting of a specified backendconfig
- "running" tab ?
- add simple per-start settings
- context & ngl
- add model per-start setting
- needs model settings (ie. download path)
- probably need global app settings somewhere
- better message formatting
- markdown conversion
** Newest Synthesis
- 2 Ressources
- BackendConfig
- includes state needed to start backend
- ie. no runtime options like -ctx -m -ngl etc
- for noparams configs only needed ui is a select dropdown
- (NO PARAMS !!!!)
- shipped llamafile
- llamafile PATH
- llama.cpp server in PATH ?
- (not mvp)
- basic & flexible pwd, cmd, args(prefix)
- templates for default options (can probably just be in the ui code, auto-filling the form ?)
- llama.cpp path prebuilt
- llama.cpp path builder
- no explicit nix support for now!
- BackendProcess
- initialy just start/stop with hardcoded config
- RunTimeConfig
- model
- context etc
** Open Questions
- how to model multiple launched instances ?
- could have different parameters or models loadead
** Synthesis ?
- model backend as ressource
- runner can start stop
- build interactor pattern services ?
** (Maybe) better option runner module seperate as a kind of micro subservice
- only startup fn in main, nothing pub apart from that
- server api code stays like a mostly simple crud app
- start background jobs on startub
- starter/manager
- reads intended backend state from sqlite
- has internal state in struct
- makes internal state agree with db
- starts backends
- stops backends
- etc?
- frontend just reads and writes db via server fns
- other background job for having always up-to-date status for backends ?
- expose status checker via backendapi interface trait
** (Maybe) stupid option
- continue current plan, start on demand via server_fn request
- how to handle only starting a single backend
- some in process registry needed ?
* MVP
** Backends
- start on demand
- simple start/stop
- as background service
- simple status via /health
- Options
- llamafile
- in $PATH
- as executable file next to binary, (enables creating a zip which "just works")
- llama.cpp
- via nix via path to llama.cpp directory
- via path to binary
- Settings
- context
- gpu layers
- keep model hardcoded for now
** Chat Prompt Template
- simple template defs to get from chat format (with role) to bare text prompt
- collect some default templates (chatml/llama3)
- migrate to /completions api
- apply to specific models ?
** Model Selection
- set folder in general settings
- read gguf metadata via gguf crate
- per-model settings (layers? ctx?, vram prediction ?)
** Inference settings (in chat as modal or sth like that)
- set sampler params in chat settings
** Settings hierarchy ?
- per_chat>per_model>per_backend>global
** Setting types ?
- Model loading
- context
- gpu layers
- Sampling
- temperature
- Prompt template
* Settings planning
** Per Backend
*** runner config
- pwd
- cmd
- template for args
- model
- chat template
- infer settings ? ( low prio, should switch to other API, that allows settings these at runtime )
** Per Model
*** offloading layers ?
** per chat
*** inference settings( runtime )
* Settings todo
- start/stop
- start current backend on demand, just start stop on settings page
- disable buttons when backend isn ´t running
- only allow llama-cpp/llamafile launch arguments for now
* Next steps (teaser)
- [x] finish basic chat
- [x] bigger bubbles (use screen, +flex grow?/maybe even grid?)
- [x] edit history + system prompt
- [x] regenerate latest response
# - save history to db (postponed until multichat)
- [ ] backend page
- [ ] infer sampling settings
- [ ] running settings (gpu layer, context size etc)
- [ ] model page
- [ ] set model dir
- [ ] list by simple filename (& size)
- [ ] offline metadata (README frontmatter yaml, filename, (gguf crate))
- [ ] chat settings
- [ ] none for now, single model & settigns et is selected on respective pages
* Next steps (private mvp)
- chatrooms
- settings/model/etc per chatroom, multiple settingss ets
* TODO MVP
- [ ] add test model downloader to nix devshell
- [ ] Backend config via TOML
- just based on llama.cpp /completion for now
- [ ] Basic chat GUI
- basic ui with bubbles
- advanced ui with markdown rendering
- fix incomplete quotes ?
- [ ] Prompt template & parameters via TOML
- [ ] Basic DB stuff
- single room history
- prompt templates via DB
- parameter management via DB (e.g. temperature)
- [ ] Advanced chat UI
- Multiple "Rooms"
- Set prompt & params per room
- [ ] Basic RAG
- select vector db
- qdrant ? chroma ?
* TODO Advanced features
- [ ] Backends
- Backend Runner
- llamafile
- llama.cpp nix (via cmd templates ?)
- Backend API config?
- Backend Downloader/Installer
- [ ] Inference Param Templates
- [ ] Prompt Templates
- [ ] model library
- [ ] model downloader
- [ ] model selector
- model data extractionf from gguf
- [ ] quant selector
- automatic offloading layer selection based on vram
- [ ] auto-quantize
- vocab selection
- quant checkboxes
- extract progress ETA
- imatrix generation
- dataset downloader ? (or just include a default one?)
- [ ] Better RAG
- [ ] add multiple embedding models
- [ ] add reranking
- [ ] Generic graph based prompt pre/postprocessing via UI, like ComfyUI
- [ ] DSL ? Some existing scripting stuff ?
- [ ] Graph just as visualization, with text-based config
- [ ] Fancy Graph UI
* TODO Polish
- [ ] Backend Multi-API compat e.g. llama.cpp /completion & /chat/completion
- has different features (chat/completion has hardcoded prompt template)
- support only full featured backends for now
- add chat support here
* TODO Go public
- Rename to YALU ?
- Polish README.md
- Clean history
- Add some more common backends (ollama ?)
- Sync to github
- Announce on /locallama