vault81/redvault-ai

Fork 0

Tristan Druyen fdf7cccacf

Init

2024-07-21 02:42:48 +02:00

8.9 KiB

Raw Permalink Blame History

Plan

for 0.0.1-rc1
next steps after 0.0.1-rc1
Roadmap
Design for 0.1
for 0.1
Design for backend runners
MVP
Settings planning
Settings todo
Next steps (teaser)
Next steps (private mvp)
MVP
Advanced features
Polish
Go public

TODO for 0.0.1-rc1

processmanager service
- spawn task on app startup
- loop every second
- start processes
  - query waiting processes from db
  - start them
  - change their status to running
- stop finished processes in db & remove from RAM registry
  - query status for currently running processes
  - stop those that aren't status=running
  - set their status to finished
must have tweaks
- pass options to model (ngl , path & model)
  - gpu/nogpu
- model dropdown (ls *.gguf based)
  - size
- markdown formatting with markdown-rs + set inner html
- show small backend starter widget icon /button on chat page
- test faster refresh
- chat persistence
- Config.toml
- package as appimage
- add model mode
  - amd/rocm/cuda
ideas to investigate before release
- stdout inspection
- visualize setting generation ? [not really useful once settings are per chat?]

TODO next steps after 0.0.1-rc1

markdown formatting
chat persistence
backend logs inspector
multiple chats
per chat settings/model etc
configurable ngl
custom backends via pwd, command & args
custom backend templates
prompt templates
sampling settings
chat/completion mode?
transfer planning into issues

Roadmap

0.1 model selection from dir, switch models

hardcoded ngl
llamafile in path or ./llamafile only
one chat
simple model selection
llamafile included templates only

0.2

hardcoded inbuilt chat templates
multiple chatrooms
- persist settings
  - ngl setting
- persist history
- summaries
extendended backend settings
- max running? running slots?
better model selection
- extract GGUF metadata
model downloader ?
- huggingface /api/models hardcoded to my account as owner
- develop some yalu.toml manifest? ?
chat templates /completions instead of /chat/completions

Design for 0.1

Frontend
- settings page
  - model dir
- chat settings drawer
  - model selection (from dir /.gguf?)
  - chat template (from hardcoded list)
  - start/stop
Backend
- Settings (1)
  - model path
- Chat (1)
  - Template
  - ModelSettigns
    - model
    - ngl
- BackendProcess (1)
  - status: started -> running -> finished
  - created from chat & saves its args
  - no update, only create&delete
RunnerBackend
- keep track which processes are running
- start/stop processes when needed

TODO for 0.1

Settings api
- #[server] fn update_settings
  - model_dir
Chat Api
- #[server] fn update_chat
  - ChatTemplate (llama3, chatml, phi)
  - model path
  - ngl
BackendProcess api
- #[server] fn start_process
- #[server] fn stop_process
- #[server] fn restart_process ?
BackendRunner worker
UI stuff
- settings page with model_dir
- drawer on chat
  - settings (model_path & ngl)
  - start/stop
Package for private release

TODO Design for backend runners

TODO

implement backendconfig CRUD
- backend tab
implement starting of a specified backendconfig
- "running" tab ?
add simple per-start settings
- context & ngl
add model per-start setting
- needs model settings (ie. download path)
- probably need global app settings somewhere
better message formatting
- markdown conversion

Newest Synthesis

2 Ressources
- BackendConfig
  - includes state needed to start backend
  - ie. no runtime options like -ctx -m -ngl etc
  - for noparams configs only needed ui is a select dropdown
    - (NO PARAMS !!!!)
      - shipped llamafile
      - llamafile PATH
      - llama.cpp server in PATH ?
    - (not mvp)
      - basic & flexible pwd, cmd, args(prefix)
      - templates for default options (can probably just be in the ui code, auto-filling the form ?)
        
        llama.cpp path prebuilt
        
        llama.cpp path builder
        
        no explicit nix support for now!
- BackendProcess
  - initialy just start/stop with hardcoded config
- RunTimeConfig
  - model
  - context etc

Open Questions

how to model multiple launched instances ?
- could have different parameters or models loadead

Synthesis ?

model backend as ressource
- runner can start stop
build interactor pattern services ?

(Maybe) better option runner module seperate as a kind of micro subservice

only startup fn in main, nothing pub apart from that
server api code stays like a mostly simple crud app
start background jobs on startub
- starter/manager
  - reads intended backend state from sqlite
  - has internal state in struct
  - makes internal state agree with db
    - starts backends
    - stops backends
    - etc?
frontend just reads and writes db via server fns
other background job for having always up-to-date status for backends ?
- expose status checker via backendapi interface trait

(Maybe) stupid option

continue current plan, start on demand via server_fn request
how to handle only starting a single backend
- some in process registry needed ?

MVP

Backends

start on demand
- simple start/stop
  - as background service
- simple status via /health
Options
- llamafile
  - in $PATH
  - as executable file next to binary, (enables creating a zip which "just works")
- llama.cpp
  - via nix via path to llama.cpp directory
  - via path to binary
Settings
- context
- gpu layers
- keep model hardcoded for now

Chat Prompt Template

simple template defs to get from chat format (with role) to bare text prompt
- collect some default templates (chatml/llama3)
migrate to /completions api
apply to specific models ?

Model Selection

set folder in general settings
read gguf metadata via gguf crate
per-model settings (layers? ctx?, vram prediction ?)

Inference settings (in chat as modal or sth like that)

set sampler params in chat settings

Settings hierarchy ?

per_chat>per_model>per_backend>global

Setting types ?

Model loading
- context
- gpu layers
Sampling
- temperature
Prompt template

Settings planning

Per Backend

runner config

pwd
cmd
template for args
- model
- chat template
- infer settings ? ( low prio, should switch to other API, that allows settings these at runtime )

Per Model

offloading layers ?

per chat

inference settings( runtime )

Settings todo

start/stop
- start current backend on demand, just start stop on settings page
- disable buttons when backend isn ´t running
only allow llama-cpp/llamafile launch arguments for now

Next steps (teaser)

[x] finish basic chat
- [x] bigger bubbles (use screen, +flex grow?/maybe even grid?)
- [x] edit history + system prompt
- [x] regenerate latest response
backend page
- infer sampling settings
- running settings (gpu layer, context size etc)
model page
- set model dir
- list by simple filename (& size)
- offline metadata (README frontmatter yaml, filename, (gguf crate))
chat settings
- none for now, single model & settigns et is selected on respective pages

Next steps (private mvp)

chatrooms
settings/model/etc per chatroom, multiple settingss ets

TODO MVP

add test model downloader to nix devshell
Backend config via TOML
- just based on llama.cpp /completion for now
Basic chat GUI
- basic ui with bubbles
- advanced ui with markdown rendering
  - fix incomplete quotes ?
Prompt template & parameters via TOML
Basic DB stuff
- single room history
- prompt templates via DB
- parameter management via DB (e.g. temperature)
Advanced chat UI
- Multiple "Rooms"
- Set prompt & params per room
Basic RAG
- select vector db
  - qdrant ? chroma ?

TODO Advanced features

Backends
- Backend Runner
  - llamafile
  - llama.cpp nix (via cmd templates ?)
- Backend API config?
- Backend Downloader/Installer
Inference Param Templates
Prompt Templates
model library
- model downloader
- model selector
  - model data extractionf from gguf
- quant selector
  - automatic offloading layer selection based on vram
- auto-quantize
  - vocab selection
  - quant checkboxes
  - extract progress ETA
  - imatrix generation
  - dataset downloader ? (or just include a default one?)
Better RAG
- add multiple embedding models
- add reranking
Generic graph based prompt pre/postprocessing via UI, like ComfyUI
- DSL ? Some existing scripting stuff ?
- Graph just as visualization, with text-based config
- Fancy Graph UI

TODO Polish

Backend Multi-API compat e.g. llama.cpp /completion & /chat/completion
- has different features (chat/completion has hardcoded prompt template)
- support only full featured backends for now
- add chat support here

TODO Go public

Rename to YALU ?
Polish README.md
Clean history
Add some more common backends (ollama ?)
Sync to github
Announce on /locallama

8.9 KiB Raw Permalink Blame History Unescape Escape

Plan

TODO for 0.0.1-rc1

TODO next steps after 0.0.1-rc1

Roadmap

Design for 0.1

TODO for 0.1

TODO Design for backend runners

TODO

Newest Synthesis

Open Questions

Synthesis ?

(Maybe) better option runner module seperate as a kind of micro subservice

(Maybe) stupid option

MVP

Backends

Chat Prompt Template

Model Selection

Inference settings (in chat as modal or sth like that)

Settings hierarchy ?

Setting types ?

Settings planning

Per Backend

runner config

Per Model

offloading layers ?

per chat

inference settings( runtime )

Settings todo

Next steps (teaser)

Next steps (private mvp)

TODO MVP

TODO Advanced features

TODO Polish

TODO Go public

8.9 KiB

Raw Permalink Blame History