{"id":293,"date":"2025-09-24T04:38:59","date_gmt":"2025-09-24T04:38:59","guid":{"rendered":"https:\/\/myallcodes.in\/?p=293"},"modified":"2025-09-24T04:38:59","modified_gmt":"2025-09-24T04:38:59","slug":"self-healing-infrastructure-with-prometheus-alertmanager-ansible","status":"publish","type":"post","link":"https:\/\/myallcodes.in\/index.php\/2025\/09\/24\/self-healing-infrastructure-with-prometheus-alertmanager-ansible\/","title":{"rendered":"\u00a0Self-Healing Infrastructure with Prometheus, Alertmanager &amp; Ansible"},"content":{"rendered":"\n<p>This project demonstrates a self-healing infrastructure setup using Prometheus, Blackbox Exporter, Alertmanager, a custom monitoring service, and Ansible. It continuously monitors a service container, triggers alerts when the service goes down, and automatically restarts the service via Ansible playbooks.<\/p>\n\n\n\n<p>\u2699\ufe0f Architecture Blackbox Exporter \u2192 Prometheus \u2192 Alertmanager \u2192 Monitor (amtool + jq) \u2192 Ansible Runner \u2192 Restart Service<\/p>\n\n\n\n<p>Blackbox Exporter \u2192 Probes endpoints (HTTP check for service availability).<\/p>\n\n\n\n<p>Prometheus \u2192 Collects metrics &amp; applies alert rules.<\/p>\n\n\n\n<p>Alertmanager \u2192 Manages alerts and forwards them to monitoring.<\/p>\n\n\n\n<p>Monitor container \u2192 Queries Alertmanager with amtool, parses alerts, and triggers recovery.<\/p>\n\n\n\n<p>Ansible Runner \u2192 Executes a playbook to restart the failed service.<\/p>\n\n\n\n<p>Service \u2192 A sample Nginx-based container being monitored.<\/p>\n\n\n\n<p>\ud83d\udcc2 Project Structure<\/p>\n\n\n\n<p>self-healing-infra\/<\/p>\n\n\n\n<p>\u2502\u2500\u2500 alertmanager\/<\/p>\n\n\n\n<p>\u2502 \u2514\u2500\u2500 config.yml # Alertmanager config<\/p>\n\n\n\n<p>\u2502\u2500\u2500 ansible\/<\/p>\n\n\n\n<p>\u2502 \u2514\u2500\u2500 playbook.yml # Ansible playbook to restart service<\/p>\n\n\n\n<p>\u2502\u2500\u2500 monitor\/<\/p>\n\n\n\n<p>\u2502 \u2514\u2500\u2500 Dockerfile # Dockerfile for monitor container<\/p>\n\n\n\n<p>\u2502\u2500\u2500 scripts\/<\/p>\n\n\n\n<p>\u2502 \u2514\u2500\u2500 monitor_alerts.sh # Monitor script<\/p>\n\n\n\n<p>\u2502\u2500\u2500 prometheus\/<\/p>\n\n\n\n<p>\u2502 \u2514\u2500\u2500 prometheus.yml # Prometheus config &amp; alert rules<\/p>\n\n\n\n<p>\u2502\u2500\u2500 docker-compose.yml # Multi-service orchestration<\/p>\n\n\n\n<p>\u2502\u2500\u2500 README.md # Project documentation<\/p>\n\n\n\n<p>\ud83d\udda5\ufe0f Local Prerequisites<\/p>\n\n\n\n<p>Before running this project, ensure you have the following installed on your local machine:<\/p>\n\n\n\n<p>Docker Desktop (latest version, with WSL2 backend enabled if on Windows)<\/p>\n\n\n\n<p>Git<\/p>\n\n\n\n<p>(Optional) Visual Studio Code for editing configs<\/p>\n\n\n\n<p>\ud83d\ude80 Getting Started 1\ufe0f\u20e3 Clone Repository git clone&nbsp;<a href=\"https:\/\/github.com\/your-username\/self-healing-infra.git\">https:\/\/github.com\/your-username\/self-healing-infra.git<\/a>&nbsp;cd self-healing-infra<\/p>\n\n\n\n<p>2\ufe0f\u20e3 Build &amp; Run docker compose up -d &#8211;build<\/p>\n\n\n\n<p>3\ufe0f\u20e3 Verify Setup<\/p>\n\n\n\n<p>Blackbox Exporter \u2192&nbsp;<a href=\"http:\/\/localhost:9115\/\">http:\/\/localhost:9115<\/a><\/p>\n\n\n\n<p>Prometheus \u2192&nbsp;<a href=\"http:\/\/localhost:9090\/\">http:\/\/localhost:9090<\/a><\/p>\n\n\n\n<p>Alertmanager \u2192&nbsp;<a href=\"http:\/\/localhost:9093\/\">http:\/\/localhost:9093<\/a><\/p>\n\n\n\n<p>Service (Nginx) \u2192&nbsp;<a href=\"http:\/\/localhost:8082\/\">http:\/\/localhost:8082<\/a><\/p>\n\n\n\n<p>4\ufe0f\u20e3 Test Self-Healing<\/p>\n\n\n\n<p>Stop the service manually:<\/p>\n\n\n\n<p>docker stop service<\/p>\n\n\n\n<p>Within ~30 seconds, monitoring detects failure, Alertmanager fires an alert, and Ansible automatically restarts the service.<\/p>\n\n\n\n<p>Check status:<\/p>\n\n\n\n<p>docker ps<\/p>\n\n\n\n<p>\ud83d\udcca Monitoring Flow<\/p>\n\n\n\n<p>Blackbox Exporter fails probe \u2192 Prometheus rule triggers.<\/p>\n\n\n\n<p>Prometheus sends alert to Alertmanager.<\/p>\n\n\n\n<p>Alertmanager exposes active alerts.<\/p>\n\n\n\n<p>Monitor script (inside container) checks alerts using amtool.<\/p>\n\n\n\n<p>On ServiceDown alert, Ansible Runner executes playbook.<\/p>\n\n\n\n<p>Service restarts automatically \ud83c\udf89.<\/p>\n\n\n\n<p>\ud83d\udee1\ufe0f Key Features<\/p>\n\n\n\n<p>\u2705 Fully automated self-healing workflow. \u2705 Works with Docker &amp; Ansible inside containers. \u2705 Uses Prometheus + Alertmanager for monitoring and alerting. \u2705 Modular design \u2013 you can extend to restart any container or service. \u2705 Cross-platform (tested on Windows + WSL2).<\/p>\n\n\n\n<p>\ud83d\udd2e Future Enhancements<\/p>\n\n\n\n<p>Integrate with Grafana for dashboards.<\/p>\n\n\n\n<p>Support multi-service healing.<\/p>\n\n\n\n<p>Use Kubernetes + Operators for scaling.<\/p>\n\n\n\n<p>Extend Ansible playbooks for VM\/Cloud service recovery.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"844\" height=\"581\" src=\"https:\/\/myallcodes.in\/wp-content\/uploads\/2025\/09\/image-1.png\" alt=\"\" class=\"wp-image-294\" srcset=\"https:\/\/myallcodes.in\/wp-content\/uploads\/2025\/09\/image-1.png 844w, https:\/\/myallcodes.in\/wp-content\/uploads\/2025\/09\/image-1-300x207.png 300w, https:\/\/myallcodes.in\/wp-content\/uploads\/2025\/09\/image-1-768x529.png 768w, https:\/\/myallcodes.in\/wp-content\/uploads\/2025\/09\/image-1-660x454.png 660w\" sizes=\"auto, (max-width: 844px) 100vw, 844px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>This project demonstrates a self-healing infrastructure setup using Prometheus, Blackbox Exporter, Alertmanager, a custom monitoring service, and Ansible. It continuously monitors a service container, triggers alerts when the service goes down, and automatically restarts the service via Ansible playbooks. \u2699\ufe0f Architecture Blackbox Exporter \u2192 Prometheus \u2192 Alertmanager \u2192 Monitor (amtool + jq) \u2192 Ansible Runner\u2026 <span class=\"read-more\"><a href=\"https:\/\/myallcodes.in\/index.php\/2025\/09\/24\/self-healing-infrastructure-with-prometheus-alertmanager-ansible\/\">Read More &raquo;<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[25],"tags":[],"class_list":["post-293","post","type-post","status-publish","format-standard","hentry","category-miscellaneous"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/myallcodes.in\/index.php\/wp-json\/wp\/v2\/posts\/293","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/myallcodes.in\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/myallcodes.in\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/myallcodes.in\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/myallcodes.in\/index.php\/wp-json\/wp\/v2\/comments?post=293"}],"version-history":[{"count":1,"href":"https:\/\/myallcodes.in\/index.php\/wp-json\/wp\/v2\/posts\/293\/revisions"}],"predecessor-version":[{"id":295,"href":"https:\/\/myallcodes.in\/index.php\/wp-json\/wp\/v2\/posts\/293\/revisions\/295"}],"wp:attachment":[{"href":"https:\/\/myallcodes.in\/index.php\/wp-json\/wp\/v2\/media?parent=293"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/myallcodes.in\/index.php\/wp-json\/wp\/v2\/categories?post=293"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/myallcodes.in\/index.php\/wp-json\/wp\/v2\/tags?post=293"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}