My apologies for the past day or so of downtime.
I had a work conference all of last week. On the last morning around 4am, before I headed back to my timezone, "something" inside of my kubernetes cluster took a dump.
While- I can remotely reboot nodes, and even access them... the scope of what went wrong was far above what I can accomplish remotely via my phone.
After returning home yesterday evening, I started plugging away a bit, and quickly realized.... something was seriously wrong with the cluster. As such, from previous experience, I found it was quicker to just tear it down, rebuild it, and restore from backups. So- I started that process.
However, since, I had not seen my wife in a week, I felt spending some time with her was slightly more important at the time. But- I was able to finish getting everything restored today.
Due, to the issues before, I will be rebuilding some areas of my infrastructure to be slightly more redundant.
Whereas before- I had bare-metal machines running ubuntu, going forward, I will be leveraging proxmox for compute clustering and HA, along with ceph for storage HA.
That being said, sometime soon, I will have ansible playbooks setup to get everything pushed out and running.
Again- My apologies for the downtime. It was completely unexpected, and came out of the blue. I honestly still have no idea what happened.
The best suspicion I have, is disk failure.... and after rebooting the machine, it came back to life?
Regardless, Will work to improve this moving forward. Also- I don't plan on being out of town soon... so, that will help too.
There may be some slight downtime later on as I am working on and moving things around. If- that is the case, it will be short. But- for now- the goal is just restoring my other services and getting back up and running.
Update 2023-07-23 CST
There are still a few kinks being worked out. I have noticed occasionally things are disconnecting still.
Working on ironing out the issues still. Please bear with me.
(This issue appears to be due to a single realtek nic in the cluster... realtek = bad)
Update 9:30pm CST
Well, it has been a "fun" evening. I have been finding issues left and right.
- A piece of bad fiber cable.
- The aforementioned server with a realtek NIC which was bringing down the entire cluster.
- STP/RSTP issues, likely caused by the above two issues.
Still, working and improving...
Update 9am CST
Working out a few minor kinks still. Finish line is in sight.
Update 5pm CST
Happened to find a SFP+ module which was in the process of dying. Swapped it out with a new one, and... magically, many of the spotty network issues went away.
Have new fiber ordered, will install later this week.
Update 9pm CST
- Broken/Intermittent SFP+ Module replaced.
- Server with crappy realtek nic removed. Re-added server with 10G SFP+ connectivity.
- Clustered servers moved to dedicated switch.
- New fiber stuff ordered to replace longer-distance (50ft) 10G copper runs.
I am aware of current performance issues. These will start going away as I expand out the cluster. Still focusing on rebuilding everything to a working state.
Lemmyonline.com is back under a new owner! In the coming days I'll be migrating the servers etc. but other than that, everything should stay the same, so feel free to enjoy the instance as you used to!
As promised, if I brought the instance offline, I would give you a heads up in advance.
Here are the reasons for me coming to this decision-
Moderation / Administration
Lemmy has absolutely ZERO administration tools, other then the ability to create a report. This, makes it extremely difficult to properly administer anything.
As well, other then running reports and queries against the local database manually, I literally do not have insight into anything. I can't even see a list of which users are registered on this instance, without running a query on the database.
I host lemmyonline.com on some of my personal infrastructure. It shares servers, storage, etc. It is powered via my home solar setup, and actually doesn't cost much to keep online.
However- for a project which compensates me exactly $0.00 USD (No- I still don't take donations). It is NOT worth the additional liability I am taking on.
That liability being- currently trolls/attackers are literally uploading child-porn to lemmy. Thumbnails and content gets synced to this instance. At that point, I am on the hook for this content. This, also goes back to the problem of literally having basically no moderation capabilities either.
Once something is posted, it is sent everywhere.
Here in the US, they like to send no-knock raids out. That is no-bueno.
One issue I have noticed, every single image/thumbnail, appears to get cached by pictrs. This data is never cleaned up, never purged.... so, it will just keep growing, and growing. The growth, isn't drastic, around 10-30G of new data per week- however, this growth isn't going to be sustainable, especially due to again- this project compensates me nothing. While- hosting 100G of content, isn't going to be a problem. When we start looking 1T, 10T, etc.... That costs money.
Its not as simple as tossing another disk into my cluster. The storage needs redundancy. So, you need multiple disks there.
Then, you need backups. A few more disks here.
Then, we need offsite backups. These cost $/TB stored.
I don't mind hosting putting some resources up front to host something that takes a nominal amount of resources. However- based on my stats, its going to continue to grow forever as there is no purge/timeout/lifespan attached to these objects.
I don't enjoy lemmy enough to want to put up with the above headaches.
Lets face it. You have already seen me complain about the general negativity around lemmy.
The quality of content here, just isn't the same. I have posted lots of interesting content to try and get collaboration going. But, it just doesn't happen.
I just don't see nearly as much interesting content, as I want to interact with.
I get no benefit from hosting lemmy online. It was a fun side project for a while. I refuse to attempt to monetize it as well.
As such, since I don't enjoy it, and the process of keeping on top of the latest attacks for the week is time consuming, and boresome, The plan is simple.
The servers will go offline 2023-09-04.
If you wish to migrate your account to another instance-
Here is a tool recently released.
A heads up....
Since, attackers/etc are now uploading CSAM (child porn....) to lemmy, which gets federated to other instances....
Because I really don't want any reason for the feds to come knocking on my door, as of this time, pictrs is now disabled.
This means.... if you try to post an image, it will fail. As well, you will notice other issues potentially.
Driver for this: https://lemmyonline.com/post/454050
This- is a hobby for me. Given the complete and utter lack of moderation tools to help me properly filter content, the nuclear approach is the only approach here.
I am just wondering... is it me- or is there a LOT of just general negativity here.
Every other post I see is...
- America is bad.
- Capitalism is bad. Socialism/Communism is good.
- If you don't like communism, you are a fascist nazi.
Honestly, it's kind of killing my mood with Lemmy. There are a few decent communities/subs here, but, the quality of content appears to be falling.
I mean, FFS. It can't just be me that is noticing this. It honestly feels like I am supporting a communist platform here.
I am on social media to post and read about things related to technology, automation, race cars, etc.
Every other technology post, is somebody bashing on Elon Musk (actually- that is deserved), or talking about Reddit (Let it go. Seriously. We are here, it is there).
On my hobby of liking racecars, I guess, half of the people on lemmy feel it is OK to vandalize a car for being too big.... and car hate culture is pretty big.
All of this is really turning off my mood regarding lemmy.
Sorry for the ~30 seconds of downtime earlier, however, we are now updated to version 0.18.4.
Base Lemmy Changes:
Lemmy UI Changes:
Official patch notes: https://join-lemmy.org/news/2023-08-08_-_Lemmy_Release_v0.18.4
- Fix fetch instance software version from nodeinfo (#3772)
- Correct logic to meet join-lemmy requirement, don’t have closed signups. Allows Open and Applications. (#3761)
- Fix ordering when doing a comment_parent type list_comments (#3823)
- Mark post as read when clicking “Expand here” on the preview image on the post listing page (#1600) (#1978)
- Update translation submodule (#2023)
- Fix comment insertion from context views. Fixes #2030 (#2031)
- Fix password autocomplete (#2033)
- Fix suggested title " " spaces (#2037)
- Expanded the RegEx to check if the title contains new line caracters. Should fix issue #1962 (#1965)
- ES-Lint tweak (#2001)
- Upgrading deps, running prettier. (#1987)
- Fix document title of admin settings being overwritten by tagline and emoji forms (#2003)
- Use proper modifier key in markdown text input on macOS (#1995)
In, addition to updating lemmy just now-
The storage issues have been resolved, the hosting issues have been resolved....
And things should return back to stable and reliable now.
Turns out... its ceph storage.
Despite having 7x OSDs on bare metal NVMe... despite having DEDICATED 10G network connectivity.... Its having significant performance issues.
Any spikes in IO (Large file transfers, backups. Even copying files to a different server) would cause huge IO delays, causing things to break or drop offline.
There are no errors shown. The configuration is pretty standard. I have no idea why it is having so many issues.
I have cleared off a new NVMe, and will move this server to it tomorrow, and hopefully end all of the issues from this week... Assuming I have any users left here. (I wouldn't blame you for leaving, it has been a really bad week for LemmyOnline)
IF, my assumptions are incorrect, then f-it, I will just run lemmy on a bare metal server I have on standby.
Server migrated to local storage. Was, nearly unnoticeable, unless you did something in the 3 minute window it took to clone/restore/etc.
Just finished migrating to a different server... hopefully this helps some.
As a continuation from the FIRST POST
As you have likely noticed, there are still issues.
To summarize the first post.... catastrophic software/hardware failure, which meant needing to restore from backups.
I decided to take the opportunity to rebuild newer, and better. As such, I decided to give proxmox a try, with a ceph storage backend.
After, getting a simple k8s environment back up and running on the cluster, and restoring the backups- lemmy online, was mostly back in business using the existing manifests.
Well, the problem is.... when heavy backend IO occurs (during backups, big operations, installing large software....), the longhorn.io storage used in the k8s environment, kind of... "dies".
And- as I have seen today, this is not an infrequent issue. I have had to bounce the VM multiple times today to restore operations.
I am currently working on building out a new VM specifically for LemmyOnline, to seperate it from the temporary k8s environment. Once, this is up and running, things should return to stable, and normal.
We were poking around the list of Lemmy servers, and we saw this crazy anomaly. 1.2 million comments in the course of a day or two .... what's going on over there?
Just, an alternative front-end... if you pretty.... a different front-end.
Apologies for the short outage today, My core switch decided to reboot for unknown reasons.
This thing has been up for close to a year, with no recent changes, or explanations behind the reboot.
Just updated to 0.18.1 rc9 for both UI and FE.
There have been a lot of bugs fixed, and hopefully this should improve the performance quite a bit too.
There seems to be some kind of bug with the Subscribe and Block buttons in the sidebar. They're appearing for me as nothing but text. Here's a picture with the inspector open showing that there's no element at all.
Also, this only occurs from the community view. Going into a post displays the buttons correctly.
I don't know if this is a Lemmy issue or instance issue, all I know is that I don't believe it was an issue on another instance I was signed-up to.
I hear- lots of issues are fixed, and federation will work better!
So- I am going to attempt to update everything to 0.18.
If- the update fails, there are backups in place, and I will revert the changes.
Edit- Giveaway was COMPLETED. Rewards have been given out.
Giveaway #1 was completed in !firstname.lastname@example.org
Giveaway #2, will occur here in !email@example.com
- At the end of the giveaway, the top highest voted comments will a PM containing a humblebundle redemption link, for the game of their choice.
- The rankings will be handled, in the order they are listed on the lemmyonline.com instance.
- You are allowed to request multiple games. If, you are chosen, The first game you have listed that is available, will be granted.
- If, there are multiple comments for the same game, only the winning / highest voted comment will be considered. (So- not a bad idea to request multiple games).
- If you do want to participate, but, don't have anything in specific you would like, make sure to comment something along the lines of, "I'll take anything". Otherwise, you will not be included.
Note- only top-level comments are considered. Replies to other comments are not considered.
This Contest Will End MONDAY 6/26, around 11am to NOON CST.
(If- I am delayed in handing out to the winners, it will occur as soon as possible)
(anything listed as claimed is unavailable.)
If you win
- I will reply to your winning comment, letting you (and others) know who won the particular game.
- You will receive a PM/DM via Lemmy, containing a humblebundle link to redeem the game. (If- it does not work for you, we can work together to determine another way to redeem the codes)
A few tips-
- Make sure to click all (instead of only local). You can change this in your profile settings.
- Here is a short guide for finding and subscribing to communities.
I will update this post as new issues arise. If- you have questions/issues, post below. Or, ask me in Discord
Pretty bad storm just knocked out power. Gonna be on backup power for a while.
Should have enough capacity to go for 6 hours. But, will have to see
Instance will be down for around 30 minutes today.
Installing new fiber NICs into the server, and adding a few Coral TPUs for good measure.
What is this?
This, is a hopeful replacement for reddit.
Why are we replacing reddit?
Because reddit has proven it does not care about its users. It does not care about the concerns of its moderators. A massive protest resulting in over 75% of reddit shutting down for a day, was just "noise" according to u/spez.
Just noise. It will pass in a day.
I don't know about you- I don't like being referred to as "Just noise".
So, I think it's time for a revolution.
If you have not done so already, it is time to join lemmy. You can either join lemmyonline.com, or, you can look for another server to join here: https://join-lemmy.org/
This entire platform is decentralized and federated. As such, you can comment or view posts on any of the servers.