8.9 KiB
Plan
- for 0.0.1-rc1
- next steps after 0.0.1-rc1
- Roadmap
- Design for 0.1
- for 0.1
- Design for backend runners
- MVP
- Settings planning
- Settings todo
- Next steps (teaser)
- Next steps (private mvp)
- MVP
- Advanced features
- Polish
- Go public
TODO for 0.0.1-rc1
-
processmanager service
- spawn task on app startup
- loop every second
-
start processes
- query waiting processes from db
- start them
- change their status to running
-
stop finished processes in db & remove from RAM registry
- query status for currently running processes
- stop those that aren't status=running
- set their status to finished
-
must have tweaks
-
pass options to model (ngl , path & model)
- gpu/nogpu
-
model dropdown (ls *.gguf based)
- size
- markdown formatting with markdown-rs + set inner html
- show small backend starter widget icon /button on chat page
- test faster refresh
- chat persistence
- Config.toml
- package as appimage
-
add model mode
- amd/rocm/cuda
-
-
ideas to investigate before release
- stdout inspection
- visualize setting generation ? [not really useful once settings are per chat?]
TODO next steps after 0.0.1-rc1
- markdown formatting
- chat persistence
- backend logs inspector
- multiple chats
- per chat settings/model etc
- configurable ngl
- custom backends via pwd, command & args
- custom backend templates
- prompt templates
- sampling settings
- chat/completion mode?
- transfer planning into issues
Roadmap
0.1 model selection from dir, switch models
- hardcoded ngl
- llamafile in path or ./llamafile only
- one chat
- simple model selection
- llamafile included templates only
0.2
- hardcoded inbuilt chat templates
-
multiple chatrooms
-
persist settings
- ngl setting
- persist history
- summaries
-
-
extendended backend settings
- max running? running slots?
-
better model selection
- extract GGUF metadata
-
model downloader ?
- huggingface /api/models hardcoded to my account as owner
- develop some yalu.toml manifest? ?
- chat templates /completions instead of /chat/completions
Design for 0.1
-
Frontend
-
settings page
- model dir
-
chat settings drawer
- model selection (from dir /.gguf?)
- chat template (from hardcoded list)
- start/stop
-
-
Backend
-
Settings (1)
- model path
-
Chat (1)
- Template
-
ModelSettigns
- model
- ngl
-
BackendProcess (1)
- status: started -> running -> finished
- created from chat & saves its args
- no update, only create&delete
-
-
RunnerBackend
- keep track which processes are running
- start/stop processes when needed
TODO for 0.1
-
Settings api
-
#[server] fn update_settings
- model_dir
-
-
Chat Api
-
#[server] fn update_chat
- ChatTemplate (llama3, chatml, phi)
- model path
- ngl
-
-
BackendProcess api
- #[server] fn start_process
- #[server] fn stop_process
- #[server] fn restart_process ?
- BackendRunner worker
-
UI stuff
- settings page with model_dir
-
drawer on chat
- settings (model_path & ngl)
- start/stop
- Package for private release
TODO Design for backend runners
TODO
-
implement backendconfig CRUD
- backend tab
-
implement starting of a specified backendconfig
- "running" tab ?
-
add simple per-start settings
- context & ngl
-
add model per-start setting
- needs model settings (ie. download path)
- probably need global app settings somewhere
-
better message formatting
- markdown conversion
Newest Synthesis
-
2 Ressources
-
BackendConfig
- includes state needed to start backend
- ie. no runtime options like -ctx -m -ngl etc
-
for noparams configs only needed ui is a select dropdown
-
(NO PARAMS !!!!)
- shipped llamafile
- llamafile PATH
- llama.cpp server in PATH ?
-
(not mvp)
- basic & flexible pwd, cmd, args(prefix)
-
templates for default options (can probably just be in the ui code, auto-filling the form ?)
- llama.cpp path prebuilt
- llama.cpp path builder
- no explicit nix support for now!
-
-
BackendProcess
- initialy just start/stop with hardcoded config
-
RunTimeConfig
- model
- context etc
-
Open Questions
-
how to model multiple launched instances ?
- could have different parameters or models loadead
Synthesis ?
-
model backend as ressource
- runner can start stop
- build interactor pattern services ?
(Maybe) better option runner module seperate as a kind of micro subservice
- only startup fn in main, nothing pub apart from that
- server api code stays like a mostly simple crud app
-
start background jobs on startub
-
starter/manager
- reads intended backend state from sqlite
- has internal state in struct
-
makes internal state agree with db
- starts backends
- stops backends
- etc?
-
- frontend just reads and writes db via server fns
-
other background job for having always up-to-date status for backends ?
- expose status checker via backendapi interface trait
(Maybe) stupid option
- continue current plan, start on demand via server_fn request
-
how to handle only starting a single backend
- some in process registry needed ?
MVP
Backends
-
start on demand
-
simple start/stop
- as background service
- simple status via /health
-
-
Options
-
llamafile
- in $PATH
- as executable file next to binary, (enables creating a zip which "just works")
-
llama.cpp
- via nix via path to llama.cpp directory
- via path to binary
-
-
Settings
- context
- gpu layers
- keep model hardcoded for now
Chat Prompt Template
-
simple template defs to get from chat format (with role) to bare text prompt
- collect some default templates (chatml/llama3)
- migrate to /completions api
- apply to specific models ?
Model Selection
- set folder in general settings
- read gguf metadata via gguf crate
- per-model settings (layers? ctx?, vram prediction ?)
Inference settings (in chat as modal or sth like that)
- set sampler params in chat settings
Settings hierarchy ?
- per_chat>per_model>per_backend>global
Setting types ?
-
Model loading
- context
- gpu layers
-
Sampling
- temperature
- Prompt template
Settings planning
Per Backend
runner config
- pwd
- cmd
-
template for args
- model
- chat template
- infer settings ? ( low prio, should switch to other API, that allows settings these at runtime )
Per Model
offloading layers ?
per chat
inference settings( runtime )
Settings todo
-
start/stop
- start current backend on demand, just start stop on settings page
- disable buttons when backend isn ´t running
- only allow llama-cpp/llamafile launch arguments for now
Next steps (teaser)
-
[x] finish basic chat
- [x] bigger bubbles (use screen, +flex grow?/maybe even grid?)
- [x] edit history + system prompt
- [x] regenerate latest response
-
backend page
- infer sampling settings
- running settings (gpu layer, context size etc)
-
model page
- set model dir
- list by simple filename (& size)
- offline metadata (README frontmatter yaml, filename, (gguf crate))
-
chat settings
- none for now, single model & settigns et is selected on respective pages
Next steps (private mvp)
- chatrooms
- settings/model/etc per chatroom, multiple settingss ets
TODO MVP
- add test model downloader to nix devshell
-
Backend config via TOML
- just based on llama.cpp /completion for now
-
Basic chat GUI
- basic ui with bubbles
-
advanced ui with markdown rendering
- fix incomplete quotes ?
- Prompt template & parameters via TOML
-
Basic DB stuff
- single room history
- prompt templates via DB
- parameter management via DB (e.g. temperature)
-
Advanced chat UI
- Multiple "Rooms"
- Set prompt & params per room
-
Basic RAG
-
select vector db
- qdrant ? chroma ?
-
TODO Advanced features
-
Backends
-
Backend Runner
- llamafile
- llama.cpp nix (via cmd templates ?)
- Backend API config?
- Backend Downloader/Installer
-
- Inference Param Templates
- Prompt Templates
-
model library
- model downloader
-
model selector
- model data extractionf from gguf
-
quant selector
- automatic offloading layer selection based on vram
-
auto-quantize
- vocab selection
- quant checkboxes
- extract progress ETA
- imatrix generation
- dataset downloader ? (or just include a default one?)
-
Better RAG
- add multiple embedding models
- add reranking
-
Generic graph based prompt pre/postprocessing via UI, like ComfyUI
- DSL ? Some existing scripting stuff ?
- Graph just as visualization, with text-based config
- Fancy Graph UI
TODO Polish
-
Backend Multi-API compat e.g. llama.cpp /completion & /chat/completion
- has different features (chat/completion has hardcoded prompt template)
- support only full featured backends for now
- add chat support here
TODO Go public
- Rename to YALU ?
- Polish README.md
- Clean history
- Add some more common backends (ollama ?)
- Sync to github
- Announce on /locallama