What a $5 VPS honeypot taught me

What I built and why

I spun up a honeypot on a cheap VPS, pointed it at the open internet, and started logging every unsolicited probe that hit it. After a few weeks I had a few hundred thousand requests: scanner fingerprints, exploit attempts, credential stuffing, the usual background radiation of the internet. I cleaned it up, stuck it behind a public API, and exposed it three ways: a plain HTTP endpoint you can hit with curl, a JSON schema for programmatic clients, and an MCP server so language models can query it directly during analysis sessions.

The pitch is simple. If you operate a small network, you don’t have a threat intel team. You don’t have a SIEM correlating across thousands of customers. What you have is a firewall log full of IPs you can’t categorize. A public probe database lets you check, in one curl call, whether an IP that hit your edge has been hitting other people’s edges with known exploit patterns. That’s not classified intelligence. That’s data anyone with a $5 VPS can collect. The question is whether collecting it in the open helps defenders more than it helps attackers. I’ll argue it does, but the argument has caveats worth naming up front.

The defender case is real and underrated

Most commercial threat feeds are expensive, opaque, and slow. You pay five figures a year, you get a list of IPs and hashes, and you have no idea how the vendor decided an IP was malicious. When the feed is wrong - and they’re often wrong - your only recourse is a support ticket.

An open probe database flips that. Every record has provenance. You can see the exact timestamp the IP hit the sensor, the exact payload it sent, the User-Agent it claimed, and the TCP fingerprint of the connection. If you think the record is wrong, you can fetch the raw evidence and decide for yourself. That kind of transparency matters when you’re deciding whether to block traffic from a /24 that might include a legitimate customer.

The second underrated benefit is speed of action. With my setup, a probe hitting the sensor at 02:14 UTC shows up in the public API by 02:15 UTC. Compare that to the typical commercial feed cycle, which batches updates every few hours. For mass-scanning campaigns - and almost all internet-wide exploitation now starts as a mass scan - those minutes matter. The scanner hits a few hundred sensors before it hits you. If you’re querying the database in your edge logic, you can block the IP before it reaches your application, on the basis of behavior it exhibited against someone else’s sensor five minutes ago.

Third, there’s the small operator problem. A solo developer running a Mastodon instance or a personal git server has the same attack surface as a Fortune 500, minus the budget. Giving that operator a free curl-able endpoint that returns “this IP has hit 412 sensors in the last 24 hours with CVE-2024-XXXX payloads” levels the playing field in a way no commercial product can, because the commercial product won’t sell to them at a price they’ll pay.

The MCP angle changes the workflow

The HTTP endpoint is the obvious interface. The MCP server is the one that changed how I actually use the data. Model Context Protocol lets a language model call the database as a tool during a conversation. So when I’m investigating an alert - say, a weird POST to /wp-login.php from an IP I’ve never seen - I can ask the model to enrich it, and the model fetches the probe history, the geolocation, the ASN, and the payload similarity to known campaigns, all in one shot.

That sounds incremental. It isn’t. The bottleneck in small-shop security work is not detection. It’s triage. You get 200 alerts a day, 198 are nothing, two are real, and you have no time to investigate any of them properly. An MCP-enriched workflow turns a five-minute manual lookup into a five-second tool call, and the model writes a paragraph of context you can paste into a ticket. The total time savings per analyst per week is probably four to six hours. That’s a junior analyst’s entire Friday afternoon back.

The risk is that the model gets confident and wrong. I’ll come back to this.

The risks are also real and need naming

A public probe database is a free reconnaissance tool for attackers. If I’m running a botnet, I can query the database to see which of my nodes have been burned, rotate them out, and stay below the detection threshold. I don’t have a great mitigation for this. The data is by definition observable - it’s literally what the attackers are sending - so withholding it from defenders doesn’t withhold it from attackers, who already have it. But I won’t pretend the asymmetry is zero. It isn’t.

The second risk is poisoning. Anyone who notices my sensor’s IP range can send crafted traffic to make innocent IPs look malicious, or to make their own infrastructure look clean by flooding the sensor with benign traffic from neighboring IPs. I run multiple sensors in multiple ASNs and require a probe to hit at least three before I flag the source, which raises the cost of poisoning without eliminating it. A determined adversary with cloud budget can still poison. The countermeasure there is downstream: consumers of the data should treat it as one signal, not a verdict.

The third risk is privacy. Probe data sometimes includes credentials the attacker is trying. Those credentials may have come from a real breach, which means they’re real usernames and real passwords belonging to real people. I redact passwords entirely and hash usernames before storage. I don’t expose either through the API. This is the kind of decision an open project has to make publicly and defend in writing, because if you get it wrong you’ve built a credential lookup service for the same attackers you’re trying to stop.

How to use it without shooting yourself in the foot

If you’re going to integrate a probe database - mine or anyone else’s - into your security pipeline, three rules.

First, never block solely on a public reputation signal. Use it as a weighting factor in a multi-signal decision. A request from a flagged IP plus an anomalous payload is a block. A flagged IP alone is a logged warning. The difference between those two policies is the difference between a working SOC and a customer support nightmare.

Second, log the queries you make. If the database is wrong about an IP and you blocked a customer, you need to be able to reconstruct that decision later. “The API told me to” is not a defensible audit trail unless you have the actual API response stored alongside the block event.

Third, if you’re using the MCP integration with a language model, do not let the model take blocking actions directly. Let it enrich, summarize, and recommend. Keep a human or a deterministic rule in the loop for any action that touches production traffic. Models are good at synthesizing context. They are not yet good at deciding which signals deserve a 200 OK and which deserve a 403.

What the data actually shows

A few patterns from the first month that surprised me.

Residential ISPs account for more probes than I expected - roughly 18% of distinct source IPs, against cloud providers at 61%. Compromised home routers and IoT devices are doing serious work for attackers. If your blocklist excludes residential ranges to avoid customer pain, you have a blind spot.

The median time between a CVE being published and probes for that CVE hitting my sensors is under 48 hours. For CVEs with public PoCs, it’s under 12. Patch windows that assume you have a week are wrong.

Most credential stuffing traffic uses User-Agents that mimic mobile apps from popular services. Generic User-Agent blocklists will catch almost none of it. If your detection logic relies on UA strings, treat that detection as decorative.

Where to find it

The repo is open. The API is rate-limited but free. The MCP server config is in the README. If you run it, send me your sensor data - the value of the database scales with the number of vantage points, and a database with 50 sensors in 50 networks is meaningfully more useful than one with 5 sensors in one network.

If you find a problem with the data - a false positive, a poisoning attempt, a privacy leak I missed - file an issue. The whole point of doing this in the open is that you get to check my work.

What a $5 VPS honeypot taught me

What I built and why

The defender case is real and underrated

The MCP angle changes the workflow

The risks are also real and need naming

How to use it without shooting yourself in the foot

What the data actually shows

Where to find it

Keep Reading

Forum sellers timestamp breaches before victims notice

The credential nobody revoked is still live

The price sheet on your zero-day

Stay in the loop