How we actually do this.
Researchers ask. Procurement teams ask. Journalists ask. This page answers: how Deepinfo collects internet-scale data, how we scan monitored assets, how we score risk, and where the ethical lines around our data work sit. If you have a question this page doesn't cover, the contact information at the bottom is real and we read it.
What we collect, and where it comes from.
Deepinfo operates an internet-scale dataset, updated continuously. Five corpora, each indexed independently and reconciled into a single queryable surface.
Exclusively the open internet: passive DNS, certificate transparency, public WHOIS, web crawling that respects robots.txt, NVD, CISA KEV, EPSS, public ASN and IP-Whois data.
No scraping behind authentication. No aggregation of personal data beyond what’s already public. No sale of personal data. No reuse of customer-tied findings to train shared models or surface to other customers.
Customer data is primarily stored in US-based infrastructure. EU-based infrastructure is available for EU customers on request, with additional regional infrastructure in Türkiye and Qatar for region-specific deployments.
Seven layers per asset, on a continuous schedule.
Every monitored asset is scanned across seven independent data layers. Each runs on its own cadence; full historical state is preserved per layer.
Registration data, registrar, registrant scope.
Network ownership, ASN, allocated ranges.
Live records: A, AAAA, MX, NS, TXT, CNAME.
Certificate state, issuer, validity, configuration.
Exposed services and listening ports.
Response headers, redirects, security headers.
Technology fingerprinting, login pages, screenshots.
DNS scans more frequently, port scans less frequently. Each layer runs on Deepinfo’s continuous schedule, not the customer’s. When something changes, the platform surfaces the delta as an event.
No authenticated scans, no exploit attempts, no load tests as part of standard monitoring. Scanning is conducted from Deepinfo-owned infrastructure with documented user agents.
Customers who want their assets scanned can verify our scanning is non-intrusive by inspecting our requests. We don’t intentionally evade detection by target hosts.
CVSS, EPSS, KEV. And the math underneath.
Per-asset and per-domain risk scores combine the issues found across the seven scanning layers, weighted by severity and by real-world exploitation signal. The unified scale lets a customer compare across their inventory and across vendor portfolios on consistent math.
Vulnerability scoring goes beyond CVSS. Every CVE detected is enriched with EPSS, the Exploit Prediction Scoring System, which models the probability of exploitation in the next 30 days based on real-world attack data. Every CVE is also flagged against CISA's Known Exploited Vulnerabilities catalog. The score weights real-world signal over theoretical severity, so a high-CVSS-with-low-EPSS-and-no-KEV genuinely ranks below a medium-CVSS-with-high-EPSS-and-active-KEV.
Scores are deterministic given a snapshot of findings: the same findings always produce the same score. We document the scoring rubric for customers who want to understand it; we don't treat the rubric as proprietary mystery.
The lines we draw, and why.
Internet-scale data work has ethical edges. We've drawn ours explicitly:
- No personal data sold. Aggregate dataset access doesn't include personal data outside what's public in WHOIS or surface-web indexing. We don't broker personal data.
- No scraping behind authentication. We don't bypass paywalls, defeat anti-scraping measures, or access content gated by accounts. The dataset is built from the open internet.
- Customer data stays customer data. Customer-discovered assets and findings are tied to the customer account. We don't surface them to other customers, sell them, or train shared models on them.
- Dark web sourcing is industry-standard. Dark web data, where included in our coverage, comes from established sourcing patterns. We don't pay for newly-stolen data; we don't position ourselves as a buyer of breach materials.
These lines aren't marketing. If we ever change them, we'll document the change publicly with rationale. Researchers and procurement teams can confirm specifics by talking to us.
If you need more detail than this page covers.
This page is a starting point. Researchers, journalists, procurement teams, and academic users sometimes need depth this page doesn't reach: specific scanning frequency for a given layer, specific data-source attribution for a given dataset, specific verification of an ethical commitment.
Reach us through one of these channels:
- General questions: [email protected]
- Research access and methodology consultation: through the Researcher Program
- Procurement-team detail: through your sales contact or Talk to us
- Security disclosures (issues with our infrastructure): via the security disclosure surface (see Trust)
See the methodology applied to your domain.
The free threat exposure report runs the methodology against your domain end-to-end. Discovery, scanning, scoring, and prioritization in one report.