BLU Discuss list archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Discuss] Anyone Played with Programming Local LLMs, Such as Llama?
- Subject: [Discuss] Anyone Played with Programming Local LLMs, Such as Llama?
- From: grg-webvisible+blu at ai.mit.edu (grg)
- Date: Sun, 1 Dec 2024 22:43:04 -0500
- In-reply-to: <d32d2bef-ca94-48c5-9f0b-46b6d6d40a03@borg.org>
- References: <d32d2bef-ca94-48c5-9f0b-46b6d6d40a03@borg.org>
I don't consider myself an expert on what you're asking but I dabble so I'll take a crack at it (since nobody else did!). it's a complex and *very* rapidly changing subject, there's a lot to say; broad questions can lead to long answers... (sorry to everyone in advance!) On Sat, Nov 30, 2024 at 02:09:42PM -0800, Kent Borg wrote: > I think I should play with the technology from > a programming angle to get a feel for what they can do and what they can't. when you say "from a programming angle" do you mean: * using LLMs to program (which they can /almost/ do... at the beginning of this year they couldn't at all, but around april they started showing a bit of promise, and now they're scoring 10x-20x what they were at the beginning of the year on programming benchmarks...) * writing a program which calls LLMs? typically code will interact with LLMs over the openai api, sending it the prompt in a json message and getting the completion back also in json. there's not much to this interaction... you can send it to someone else's computer, or run the model locally, but the api doesn't care. > I want to understand how? their very broad training can be directed in very > specific ways, such as how can I get them to recognize patterns of my > choice, this last bit falls into the category of "fine tuning", and there are several ways to do that: * most expensively, you can fine-tune the entire model by incrementally retraining its weights on new data - you'd typically use a lot of data for this, and a lot of gpus, and a lot of days or weeks. * ...or similar retraining on a well-chosen subset of weights. still needs a lot of data and gpus, still fairly expensive and slow. * a much more common method is called LoRA (low-rank adaptations), which can be done fast & cheap with even just hundreds of exemplars. people have made google colabs which can train loras for 32 billion weight models and still fit in the free tier of colab (with free gpus). * the easiest way, especially when experimenting, falls into the category of "prompt engineering" -- you give it the example inputs & outputs in every query as context, and its completion is guided by those patterns. last year context windows got long enough to be able to do something reasonable with this, and this summer models started getting released with 5x-10x longer still context windows. * (more that I'll skip because this is already too long...) > I think I would like to play with images, but I'm not stubborn in > that regard. Certainly text is useful, so is audio? images as input or images as output? those will typically be handled by different sets of models. and audio in is a different set, and audio out is a different set... there are generalist models which try to have all of those as input (but typically only text as output), but the specialist models not surprisingly can often do better. > Googling about (well, duckduckgoing about) I see that Meta's Llama isn't > option that is free to play with*. There are other free to use LLMs: > Granite, Mistral, and Gemma are the ones I have found so far. ??? meta's been open-sourcing their inference code from the start, beginning with llama 1 (last year - ancient history!); but they didn't release their weights for llama 1 (though they were leaked). but meta's been releasing all model weights since llama 2 (also last year). qwen seems to currently have a leading set of open generalist/chat models. deepseek has some good ones too. and also mistral, google's gemma, even cohere's command-r. and there are lots of fine-tunes of these (especially of llama), and many specialist models in each modality & domain. literally over a million models are available for you to download from huggingface and run on your own hardware (or on someone else's!). > * Pedantic observation: As far as I can tell *none* of the LLMs are open > source. Sure, the compilable code might be open source but that's not were > the intellectual property lies. The billions of model parameters are the > secret sauce, and they are the ultimate opaque blob. it's accurate that the model weights are the secret sauce, but as above, most of these are open. they're "opaque" in the sense that nobody in the world understands exactly why they do what they do, but you can download them for more than a million models, including every one that's been mentioned in this thread up to this point. the pedants will say that the *training* data and code which resulted in those weights after $100m worth of compute isn't released, so nobody can reproduce them from scratch (but if I had $100m that's not what I'd spend it on), and I think this is true with the sole exception of the allen institute, who has released training data, training code, training logs, fine-tuning code, etc. > Question: Has anyone here played with writing code to drive LLMs? Any > pointers for getting out of the mud easily? (Any warnings?) if what you want to do is see what llms can do - and I know this isn't in line with the ethos of this group (me included), but: don't write off paid providers (especially if you don't have beefy gpus); paid queries are usually pennies to fractions of a penny each, and if you're not doing that many this might add up to a dozen dollars a month. open models you self-host have recently gotten very very good, but unless you have multiple modern gpus you won't be able to run the good versions of them at reasonable speed. the models you can run won't compare (at least in the chat/generalist domain; I think in audio they will compare, and in image generation I think they will compare but just be slow). there are providers like openrouter and together.ai who will bounce your queries to most providers and even have some good free(-beer) models which makes it easy. oh, and google gives you 1500 free queries/day on some of their midrange models (50 free/day on their best). if you are self-hosting, ollama is the easiest way to get started (with the limited models they support), and I'd also have a serious look at jan.ai (don't write it off as a chat gui, it allegedly makes it very easy to download any model and run it with an api port to hit). quantization really helps large models fit into your vram and even dram. you probably want 5-6 bits/weight. complex topic. "tool use" is a hot topic these days, letting an llm issue commands to a set of pre-cooked functionalities you provide (e.g. web search, or fetch weather or stock price, or even issue bash commands). "rag" (retrieval-augmented generation) is popular: retrieve data relevant to the query - from the web, or a database, or document collection, etc, - and provide that as context with the query. (similar to the prompt engineering I mentioned above) imho nobody puts enough effort into the broader form of prompt engineering, which is crafting the instructions you're giving the llms. everyone asks a one-line question and assumes the llms will know what to do; they don't, you have to very carefully guide them. have llms be verbose: they "think" by generating streams of output. if you have them be brief, you're limiting their "thinking" and output quality suffers. (current name for this is CoT, chain-of-thought) warnings? don't ever believe anything an llm says, no matter how confident it sounds. it lies. a lot. have it cite sources and check them yourself. > -kb, the Kent who expects he will be using his Framework 13 laptop: a 6-core > (12-thread) AMD Ryzen 7640U CPU with 64GB of RAM, but maybe he plugs in a > Hailo M.2 AI module, too. I don't know anything about the hailo m.2; I briefly looked it up just now, and what I didn't see was any hbm/vram - which makes me question its value. these days, llms are memory bandwidth-bound (and stats on their website were from 2021?! different world!). how much memory you have in your gpu is what determines what models you can run at a tolerable speed. your memory bandwidth is great for cpus but is 50-100GB/s; the leading gpus which all serious players are using have >1500GB/s, and 80GB. even midrange gamer gpus have 24GB of 1000GB/s now. you absolutely can run llms cpu-only, but you'll either need to choose smaller models (maybe a few billion weights at most?) or be okay with waiting a long time for a response. I'm unsure whether that hailo module will make any difference; your compute likely outruns your memory already. plus, as above, $200 buys a lot (10k-100k?) of paid queries from the very best models on the very best hardware. yikes, this is way too long (sorry!), I'll stop now. --grg
- Prev by Date: [Discuss] More OpenWRT Stuffs
- Next by Date: [Discuss] More OpenWRT Stuffs
- Previous by thread: [Discuss] More OpenWRT Stuffs
- Next by thread: [Discuss] Installing BSD nvi 1.81.6 on Fedora 41
- Index(es):