dice.camp is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Mastodon server for RPG folks to hang out and talk. Not owned by a billionaire.

Administered by:

Server stats:

1.6K
active users

#prometheus

0 posts0 participants0 posts today

Quick promql tip: promql _hates_ duplicate data in joins and will throw an absolute fit if you ever have them.

Unfortunately sometimes you end up with overlapping data points, because, e.g., kube-state-metrics is rolling out and there are temporarily two copies running, or because Prometheus also keeps stale data for up to five minutes after a datapoint is no longer seen.

The way I've worked around this in the past is to throw my query inside `max(metric) by(labels I care about)`, but I also learned/realized today that you could instead do `topk(1, metric)`, which is like `max`, except that it keeps all the labels on that datapoint.

Obviously you should think carefully about whether this is correct or not for your use case. If your metric is binary (only ever 0 or 1) the max is fine. If you use topk and you depend on the labels output, be aware that it's just going to pick one arbitrarily.

It would be best if you are able to clean up/select out the duplicate metrics, but sometimes that's.... not possible or more annoying than it's worth.

Where do I put `and sum(pve_lock_state{state="backup"}) == 0` in this thing? All the obvious spots don't return anything

`(rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}`

@dumpsterqueer hey there, I've been running #gotosocial for a little while now and want to know more about what it's doing behind the scenes (database size, storage, federation, etc.)

I've set up a #Prometheus container and successfully connected it to my #GTS server, but I'm completely new to Prometheus and don't where to go from here.

Are there any example queries online I can refer to, or a schema for the database populated by the GTS metrics scraping?

Questions for #fediverse #SysAdmin folk who run #prometheus and #grafana on multiple #aws accounts:

Do you use a single Prometheus server or do you have one for each account?

How do you handle auto scaling EC2 servers which could change IPs at any time?

Is it possible for servers to push rather than get pulled from?

Please share any setups that could be relevant and boost far & wide!

Thanks in advance!! ❤️❤️❤️

The #Aquara temperature sensor on my balkony is flaky. Every couple of weeks it just disappears for no apparent reason. So I spent my morning figuring out how to enable #Prometheus metrics in #HomeAssistant and wrote an alerting rule to get a notification on my phone via my local ntfy instance.

Yay, computers. 🙈

Is your company looking for a keen self-hoster with plenty of #Linux experience? I grew up with #RaspberryPi and have picked up many skills along the way including #React, backend JavaScript (#NodeJS) and #Docker. My current obsession is monitoring all the things with #Grafana, #PRTG and #Prometheus. I’m based in the UK but open to primarily English-speaking roles in Germany, too. Currently wrapping up my Advanced Software Development degree but eager to continue learning! Boosts appreciated :D

In Greek mythology, when the gods were deciding which parts of an animal would be the divine sacrifice, Prometheus tricked Zeus into choosing the bones wrapped in fat rather than the meat. Thus, humans were able to eat the best parts of the meat themselves instead of offering them up. This enraged Zeus.
🎨 Heinrich Friedrich Füger

I've been disappointed about this for at least the last decade, but if you feel that the polling-based designs of Kubernetes and Prometheus are "wrong", here's some science:
arxiv.org/abs/2507.02158

arXiv.orgSignalling Health for Improved Kubernetes Microservice AvailabilityMicroservices are often deployed and managed by a container orchestrator that can detect and fix failures to maintain the service availability critical in many applications. In Poll-based Container Monitoring (PCM), the orchestrator periodically checks container health. While a common approach, PCM requires careful tuning, may degrade service availability, and can be slow to detect container health changes. An alternative is Signal-based Container Monitoring (SCM), where the container signals the orchestrator when its status changes. We present the design, implementation, and evaluation of an SCM approach for Kubernetes and empirically show that it has benefits over PCM, as predicted by a new mathematical model. We compare the service availability of SCM and PCM over six experiments using the SockShop benchmark. SCM does not require that polling intervals are tuned, and yet detects container failure 86\% faster than PCM and container readiness in a comparable time with limited resource overheads. We find PCM can erroneously detect failures, and this reduces service availability by 4\%. We propose that orchestrators offer SCM features for faster failure detection than PCM without erroneous detections or careful tuning.