bambax an hour ago

> I decided to explore self-hosting some of my non-critical applications

Self-hosting static or almost-static websites is now really easy with a Cloudflare front. I just closed my account on SmugMug and published my images locally using my NAS; this costs no extra money (is basically free) since the photos were already on the NAS, and the NAS is already powered on 24-7.

The NAS I use is an Asustor so it's not really Linux and you can't install what you want on it, but it has Apache, Python and PHP with Sqlite extension, which is more than enough for basic websites.

Cloudflare free is like magic. Response times are near instantaneous and setup is minimal. You don't even have to configure an SSL certificate locally, it's all handled for you and works for wildcard subdomains.

And of course if one puts a real server behind it, like in the post, anything's possible.

  • Reubend 44 minutes ago

    Is the NAS exposed to the whole internet? Or did you find a clever way to get CloudFlare in front of it despite it just being local?

taosx 2 hours ago

For the people who self-host LLMs at home: what use cases do you have?

Personally, I have some notes and bookmarks that I'd like to scrape, then have an LLM summarize, generate hierarchical tags, and store in a database. For the notes part at least, I wouldn't want to give them to another provider; even for the bookmarks, I wouldn't be comfortable passing my reading profile to anyone.

  • xyc 2 hours ago

    llama3.2 1b & 3b is really useful for quick tasks like creating some quick scripts from some text, then pasting them to execute as it's super fast & replaces a lot of temporary automation needs. If you don't feel like invest time into automation, sometimes you can just feed into an LLM.

    This is one of the reason why recently I added floating chat to https://recurse.chat/ to quickly access local LLM.

    Here's a demo: https://x.com/recursechat/status/1846309980091330815

    • taosx 2 hours ago

      Looks very nice, saved it for later. Last week, I worked on implementing always-on speech-to-text functionality for automating tasks. I've made significant progress, achieving decent accuracy, but I imposed some self-imposed constraints to implement certain parts from scratch to deliver a single binary deployable solution, which means I still have work to do (audio processing is new territory for me). However, I'm optimistic about its potential.

      That being said, I think the more straightforward approach would be to utilize an existing library like https://github.com/collabora/WhisperLive/ within a Docker container. This way, you can call it via WebSocket and integrate it with my LLM, which could also serve as a nice feature in your product.

      • xyc an hour ago

        Thanks! lmk when/if you wanna give it a spin as free trial hasn't been updated with the latest but I'll try to do it this week.

        I've actually been playing around with speech to text recently. Thank you for the pointer, docker is a bit too heavy to deploy for desktop app use case but it's good to know about the repo. Building binaries with Pyinstaller could be an option though.

        Real time transcription seems a bit complicated as it involves VAD so a feasible path for me is to first ship simple transcription with whisper.cpp. large-v3-turbo looks fast enough :D

        • taosx an hour ago

          Yes it's fast enough, especially if you don't need something live.

    • afro88 44 minutes ago

      Can you list some real temporary automation needs you've fulfilled? The demo shows asking for facts about space. Lower param models seem to be not great as raw chat models, so I'm interested in what they are doing well for you in this context

  • TechDebtDevin an hour ago

    I keep an 8b running with ollama/openwebui to ask it to format things, summarization, and to generate SQL/simple bash commands and what not.

    • worldsayshi an hour ago

      So 8b is really smart enough to write scripts for you? How often does it fail?

      • wokwokwok an hour ago

        > So 8b is really smart enough to write scripts for you?

        Depends on the model, but in general, no.

        ...but it's fine for simple 1 liner commands like "how do I revert my commit?" or "rename these files to camelcase".

        > How often does it fail?

        Immediately and constantly if you ask anything hard.

        An 8b model is not chat-gpt. The 3B model in the OP post is not chat-gpt.

        The capability compared to sonnet/4o is like a potato and a car.

        Search for 'LLM Leaderboard' and you can see for yourself. The 8b models do not even rank. They're generally not capable enough to use as a self hosted assistant.

  • laniakean an hour ago

    I mostly use it to write some quick scripts or generate texts if it follows some pattern. Also, getting it up running with LM studio is pretty straightforward.

  • segalord 2 hours ago

    I use it exclusively for users on my personal website to chat with my data. I've given the setup tools to have read access my files and data

    • netdevnet 2 hours ago

      Is this not something that you can with non-hosted LLMs like ChatGPT? If you expose your data, it should be able to access it iirc

      • worldsayshi an hour ago

        You can absolutely do that but then you pay by the token instead of a big upfront hardware cost. It feels different I suppose. Sunk cost and all that.

  • ein0p an hour ago

    I run Mistral Large on 2xA6000. 9 times out of 10 the response is the same quality as GPT 4o. My employer does not allow the use of GPT for privacy related reasons. So I just use a private Mistral for that

netdevnet 2 hours ago

Am I right thinking that a self-hosted llama wouldn't have the kind restrictions ChatGPT has since it has no initial system prompt?

  • dtquad 2 hours ago

    All the self-hosted LLM and text-to-image models come with some restrictions trained into them [1]. However there are plenty of people who have made uncensored "forks" of these models where the restrictions have been "trained away" (mostly by fine-tuning).

    You can find plenty of uncensored LLM models here:

    https://ollama.com/library

    [1]: I personally suspect that many LLMs are still trained on WebText, derivatives of WebText, or using synthetic data generated by LLMs trained on WebText. This might be why they feel so "censored":

    >WebText was generated by scraping only pages linked to by Reddit posts that had received at least three upvotes prior to December 2017. The corpus was subsequently cleaned

    The implications of so much AI trained on content upvoted by 2015-2017 redditors is not talked about enough.

    • thrdbndndn 2 minutes ago

      My to-go test for uncensoring is to ask the LLM to write erotic novel.

      But I haven't yet find any "uncensored" ones (on ollama) that works. Did I miss something?

      (On the contrary: when ChatGPT first came out, it was trivial to jailbreak it to make it write erotica.)

  • nubinetwork an hour ago

    That depends on the frontend, you can supply a system prompt if you want to... whether it follows it to the letter is another problem...

  • Kudos 2 hours ago

    Many protections are baked into the models themselves.

  • exe34 2 hours ago

    It has a sanitised output. You might want to look for "abliterated" models, where the general performance might drop a bit but the guard-rails have been diminished.

varun_ch 3 hours ago

I’m curious about how good the performance with local LLMs is on ‘outdated’ hardware like the author’s 2060. I have a desktop with a 2070 super that it could be fun to turn into an “AI server” if I had the time…

  • khafra 2 hours ago

    If you want to set up an AI server for your own use, it's exceedingly easy to install LM Studio and hit the "serve an API" button.

    Testing performance this way, I got about 0.5-1.5 tokens per second with an 8GB 4bit quantized model on an old DL360 rack-mount server with 192GB RAM and 2 E5-2670 CPUs. I got about 20-50 tokens per second on my laptop with a mobile RTX 4080.

    • taosx 2 hours ago

      LM studio is so nice, I'm up and running in 5 minutes. ty

  • magicalhippo 2 hours ago

    I've been playing with some LLMs like Llama 3 and Gemma on my 2080Ti. If it fits in GPU memory the inference speed is quite decent.

    However I've found quality of smaller models to be quite lacking. The Llama 3.2 3B for example is much worse than Gemma2 9B, which is the one I found performs best while fitting comfortably.

    Actual sentences are fine, but it doesn't follow prompts as well and it doesn't "understand" the context very well.

    Quantization brings down memory cost, but there seems to be a sharp decline below 5 bits for those I tried. So a larger but heavily quantized model usually performs worse, at least with the models I've tried so far.

    So with only 6GB of GPU memory I think you either have to accept the hit on inference speed by only partially offloading, or accept fairly low model quality.

    Doesn't mean the smaller models can't be useful, but don't expect ChatGPT 4o at home.

    That said if you got a beefy CPU then it can be reasonable to have it do a few of the layers.

    Personally I found Gemma2 9B quantized to 6 bit IIRC to be quite useful. YMMV.

  • dtquad an hour ago

    I am using an old laptop with a GTX 1060 6 GB VRAM to run a home server with Ubuntu and Ollama. Because of quantization Ollama can run 7B/8B models on an 8 year old laptop GPU with 6 GB VRAM.

  • taosx 2 hours ago

    Last time I tried a local llm was about a year ago with a 2070S and 3950x and the performance was quite slow for anything beyond phi 3.5 and the small models quality feels worse than what some providers offer for cheap or free so it doesn't seem worth it with my current hardware.

    Edit: I've loaded llama 3.1 8b instruct GGUF and I got 12.61 tok/sec and 80tok/sec for 3.2 3b.

  • nubinetwork an hour ago

    I'm happy with a Radeon VII, unless the model is bigger than 16gb...

seungwoolee518 3 hours ago

Great post!

However, Do I need to Install CUDA toolkit on host?

I haven't install CUDA toolkit when I use on Containerized platform (like docker)

ragebol 2 hours ago

Probably saves a bit on the gas bill for heating too

  • rglullis an hour ago

    Snark aside, even in Germany (where electricity is very expensive) it is more economical to self host than to pay for a subscription to any of the commercial providers.

  • CraigJPerry an hour ago

    I don’t know, it’s kind of amazing how good the lighter weight self hosted models are now.

    Given a 16gb system with cpu inference only, I’m hosting gemma2 9b at q8 for llm tasks and SDXL turbo for image work and besides the memory usage creeping up for a second or so while i invoke a prompt, they’re basically undetectable in the background.

  • szundi an hour ago

    If only we had heat-pump computers

    • ragebol 40 minutes ago

      I'd gladly run whatever model you want at home, rent it out so you can pay for both heating, the GPU and the power consumed :-)

satvikpendem 2 hours ago

I love Coolify, used to use v3, anyone know how their v4 is going? I thought it was still a beta release from what I saw on GitHub.

  • j12a 2 hours ago

    Coolify is quite nice, have been running some things with the v4 beta.

    It reminds a bit of making web sites with a page builder. Easy to install and click around to get something running without thinking too much about it fairly quickly.

    Problems are quite similar also, training wheels getting stuck in the woods more easily, hehe.

  • whitefables 2 hours ago

    I'm using v4 beta in the blog post. Didn't try v3 so there's no point of comparison but I'm loving it so far!

    It was so easy to get other non-AI stuffs running!

_blk 2 hours ago

Why disable LVM for a smoother reboot experience? For encryption I get it since you need a key to mount, but all my setups have LVM or ZFS and I'd say my reboots are smooth enough.