redvault-ai/llama_forge_rs/PLAN.org
2024-07-21 02:42:48 +02:00

8.9 KiB
Raw Permalink Blame History

Plan

TODO for 0.0.1-rc1

  • processmanager service

    • spawn task on app startup
    • loop every second
    • start processes

      • query waiting processes from db
      • start them
      • change their status to running
    • stop finished processes in db & remove from RAM registry

      • query status for currently running processes
      • stop those that aren't status=running
      • set their status to finished
  • must have tweaks

    • pass options to model (ngl , path & model)

      • gpu/nogpu
    • model dropdown (ls *.gguf based)

      • size
    • markdown formatting with markdown-rs + set inner html
    • show small backend starter widget icon /button on chat page
    • test faster refresh
    • chat persistence
    • Config.toml
    • package as appimage
    • add model mode

      • amd/rocm/cuda
  • ideas to investigate before release

    • stdout inspection
    • visualize setting generation ? [not really useful once settings are per chat?]

TODO next steps after 0.0.1-rc1

  • markdown formatting
  • chat persistence
  • backend logs inspector
  • multiple chats
  • per chat settings/model etc
  • configurable ngl
  • custom backends via pwd, command & args
  • custom backend templates
  • prompt templates
  • sampling settings
  • chat/completion mode?
  • transfer planning into issues

Roadmap

0.1 model selection from dir, switch models

  • hardcoded ngl
  • llamafile in path or ./llamafile only
  • one chat
  • simple model selection
  • llamafile included templates only

0.2

  • hardcoded inbuilt chat templates
  • multiple chatrooms

    • persist settings

      • ngl setting
    • persist history
    • summaries
  • extendended backend settings

    • max running? running slots?
  • better model selection

    • extract GGUF metadata
  • model downloader ?

    • huggingface /api/models hardcoded to my account as owner
    • develop some yalu.toml manifest? ?
  • chat templates /completions instead of /chat/completions

Design for 0.1

  • Frontend

    • settings page

      • model dir
    • chat settings drawer

      • model selection (from dir /.gguf?)
      • chat template (from hardcoded list)
      • start/stop
  • Backend

    • Settings (1)

      • model path
    • Chat (1)

      • Template
      • ModelSettigns

        • model
        • ngl
    • BackendProcess (1)

      • status: started -> running -> finished
      • created from chat & saves its args
      • no update, only create&delete
  • RunnerBackend

    • keep track which processes are running
    • start/stop processes when needed

TODO for 0.1

  • Settings api

    • #[server] fn update_settings

      • model_dir
  • Chat Api

    • #[server] fn update_chat

      • ChatTemplate (llama3, chatml, phi)
      • model path
      • ngl
  • BackendProcess api

    • #[server] fn start_process
    • #[server] fn stop_process
    • #[server] fn restart_process ?
  • BackendRunner worker
  • UI stuff

    • settings page with model_dir
    • drawer on chat

      • settings (model_path & ngl)
      • start/stop
  • Package for private release

TODO Design for backend runners

TODO

  • implement backendconfig CRUD

    • backend tab
  • implement starting of a specified backendconfig

    • "running" tab ?
  • add simple per-start settings

    • context & ngl
  • add model per-start setting

    • needs model settings (ie. download path)
    • probably need global app settings somewhere
  • better message formatting

    • markdown conversion

Newest Synthesis

  • 2 Ressources

    • BackendConfig

      • includes state needed to start backend
      • ie. no runtime options like -ctx -m -ngl etc
      • for noparams configs only needed ui is a select dropdown

        • (NO PARAMS !!!!)

          • shipped llamafile
          • llamafile PATH
          • llama.cpp server in PATH ?
        • (not mvp)

          • basic & flexible pwd, cmd, args(prefix)
          • templates for default options (can probably just be in the ui code, auto-filling the form ?)

            • llama.cpp path prebuilt
            • llama.cpp path builder
            • no explicit nix support for now!
    • BackendProcess

      • initialy just start/stop with hardcoded config
    • RunTimeConfig

      • model
      • context etc

Open Questions

  • how to model multiple launched instances ?

    • could have different parameters or models loadead

Synthesis ?

  • model backend as ressource

    • runner can start stop
  • build interactor pattern services ?

(Maybe) better option runner module seperate as a kind of micro subservice

  • only startup fn in main, nothing pub apart from that
  • server api code stays like a mostly simple crud app
  • start background jobs on startub

    • starter/manager

      • reads intended backend state from sqlite
      • has internal state in struct
      • makes internal state agree with db

        • starts backends
        • stops backends
        • etc?
  • frontend just reads and writes db via server fns
  • other background job for having always up-to-date status for backends ?

    • expose status checker via backendapi interface trait

(Maybe) stupid option

  • continue current plan, start on demand via server_fn request
  • how to handle only starting a single backend

    • some in process registry needed ?

MVP

Backends

  • start on demand

    • simple start/stop

      • as background service
    • simple status via /health
  • Options

    • llamafile

      • in $PATH
      • as executable file next to binary, (enables creating a zip which "just works")
    • llama.cpp

      • via nix via path to llama.cpp directory
      • via path to binary
  • Settings

    • context
    • gpu layers
    • keep model hardcoded for now

Chat Prompt Template

  • simple template defs to get from chat format (with role) to bare text prompt

    • collect some default templates (chatml/llama3)
  • migrate to /completions api
  • apply to specific models ?

Model Selection

  • set folder in general settings
  • read gguf metadata via gguf crate
  • per-model settings (layers? ctx?, vram prediction ?)

Inference settings (in chat as modal or sth like that)

  • set sampler params in chat settings

Settings hierarchy ?

  • per_chat>per_model>per_backend>global

Setting types ?

  • Model loading

    • context
    • gpu layers
  • Sampling

    • temperature
  • Prompt template

Settings planning

Per Backend

runner config

  • pwd
  • cmd
  • template for args

    • model
    • chat template
    • infer settings ? ( low prio, should switch to other API, that allows settings these at runtime )

Per Model

offloading layers ?

per chat

inference settings( runtime )

Settings todo

  • start/stop

    • start current backend on demand, just start stop on settings page
    • disable buttons when backend isn ´t running
  • only allow llama-cpp/llamafile launch arguments for now

Next steps (teaser)

  • [x] finish basic chat

    • [x] bigger bubbles (use screen, +flex grow?/maybe even grid?)
    • [x] edit history + system prompt
    • [x] regenerate latest response
  • backend page

    • infer sampling settings
    • running settings (gpu layer, context size etc)
  • model page

    • set model dir
    • list by simple filename (& size)
    • offline metadata (README frontmatter yaml, filename, (gguf crate))
  • chat settings

    • none for now, single model & settigns et is selected on respective pages

Next steps (private mvp)

  • chatrooms
  • settings/model/etc per chatroom, multiple settingss ets

TODO MVP

  • add test model downloader to nix devshell
  • Backend config via TOML

    • just based on llama.cpp /completion for now
  • Basic chat GUI

    • basic ui with bubbles
    • advanced ui with markdown rendering

      • fix incomplete quotes ?
  • Prompt template & parameters via TOML
  • Basic DB stuff

    • single room history
    • prompt templates via DB
    • parameter management via DB (e.g. temperature)
  • Advanced chat UI

    • Multiple "Rooms"
    • Set prompt & params per room
  • Basic RAG

    • select vector db

      • qdrant ? chroma ?

TODO Advanced features

  • Backends

    • Backend Runner

      • llamafile
      • llama.cpp nix (via cmd templates ?)
    • Backend API config?
    • Backend Downloader/Installer
  • Inference Param Templates
  • Prompt Templates
  • model library

    • model downloader
    • model selector

      • model data extractionf from gguf
    • quant selector

      • automatic offloading layer selection based on vram
    • auto-quantize

      • vocab selection
      • quant checkboxes
      • extract progress ETA
      • imatrix generation
      • dataset downloader ? (or just include a default one?)
  • Better RAG

    • add multiple embedding models
    • add reranking
  • Generic graph based prompt pre/postprocessing via UI, like ComfyUI

    • DSL ? Some existing scripting stuff ?
    • Graph just as visualization, with text-based config
    • Fancy Graph UI

TODO Polish

  • Backend Multi-API compat e.g. llama.cpp /completion & /chat/completion

    • has different features (chat/completion has hardcoded prompt template)
    • support only full featured backends for now
    • add chat support here

TODO Go public

  • Rename to YALU ?
  • Polish README.md
  • Clean history
  • Add some more common backends (ollama ?)
  • Sync to github
  • Announce on /locallama