How we actually do this.

Researchers ask. Procurement teams ask. Journalists ask. This page answers: how Deepinfo collects internet-scale data, how we scan monitored assets, how we score risk, and where the ethical lines around our data work sit. If you have a question this page doesn't cover, the contact information at the bottom is real and we read it.

DATA COLLECTION

What we collect, and where it comes from.

Deepinfo operates an internet-scale dataset, updated continuously. Five corpora, each indexed independently and reconciled into a single queryable surface.

Domains
400M+
Subdomains
2B+
DNS records
200B+
SSL certificates
30B+
CVEs (with EPSS & KEV)
338K+
Sources

Exclusively the open internet: passive DNS, certificate transparency, public WHOIS, web crawling that respects robots.txt, NVD, CISA KEV, EPSS, public ASN and IP-Whois data.

What we don’t do

No scraping behind authentication. No aggregation of personal data beyond what’s already public. No sale of personal data. No reuse of customer-tied findings to train shared models or surface to other customers.

Data residency

Customer data is primarily stored in US-based infrastructure. EU-based infrastructure is available for EU customers on request, with additional regional infrastructure in Türkiye and Qatar for region-specific deployments.

SCANNING METHODOLOGY

Seven layers per asset, on a continuous schedule.

Every monitored asset is scanned across seven independent data layers. Each runs on its own cadence; full historical state is preserved per layer.

01
Whois

Registration data, registrar, registrant scope.

02
IP-Whois

Network ownership, ASN, allocated ranges.

03
DNS

Live records: A, AAAA, MX, NS, TXT, CNAME.

04
SSL

Certificate state, issuer, validity, configuration.

05
Port scan

Exposed services and listening ports.

06
HTTP

Response headers, redirects, security headers.

07
Web data

Technology fingerprinting, login pages, screenshots.

Cadence

DNS scans more frequently, port scans less frequently. Each layer runs on Deepinfo’s continuous schedule, not the customer’s. When something changes, the platform surfaces the delta as an event.

Non-intrusive

No authenticated scans, no exploit attempts, no load tests as part of standard monitoring. Scanning is conducted from Deepinfo-owned infrastructure with documented user agents.

Verifiable

Customers who want their assets scanned can verify our scanning is non-intrusive by inspecting our requests. We don’t intentionally evade detection by target hosts.

SCORING AND PRIORITIZATION

CVSS, EPSS, KEV. And the math underneath.

Per-asset and per-domain risk scores combine the issues found across the seven scanning layers, weighted by severity and by real-world exploitation signal. The unified scale lets a customer compare across their inventory and across vendor portfolios on consistent math.

Vulnerability scoring goes beyond CVSS. Every CVE detected is enriched with EPSS, the Exploit Prediction Scoring System, which models the probability of exploitation in the next 30 days based on real-world attack data. Every CVE is also flagged against CISA's Known Exploited Vulnerabilities catalog. The score weights real-world signal over theoretical severity, so a high-CVSS-with-low-EPSS-and-no-KEV genuinely ranks below a medium-CVSS-with-high-EPSS-and-active-KEV.

Scores are deterministic given a snapshot of findings: the same findings always produce the same score. We document the scoring rubric for customers who want to understand it; we don't treat the rubric as proprietary mystery.

DATA ETHICS

The lines we draw, and why.

Internet-scale data work has ethical edges. We've drawn ours explicitly:

  • No personal data sold. Aggregate dataset access doesn't include personal data outside what's public in WHOIS or surface-web indexing. We don't broker personal data.
  • No scraping behind authentication. We don't bypass paywalls, defeat anti-scraping measures, or access content gated by accounts. The dataset is built from the open internet.
  • Customer data stays customer data. Customer-discovered assets and findings are tied to the customer account. We don't surface them to other customers, sell them, or train shared models on them.
  • Dark web sourcing is industry-standard. Dark web data, where included in our coverage, comes from established sourcing patterns. We don't pay for newly-stolen data; we don't position ourselves as a buyer of breach materials.

These lines aren't marketing. If we ever change them, we'll document the change publicly with rationale. Researchers and procurement teams can confirm specifics by talking to us.

TRANSPARENCY

If you need more detail than this page covers.

This page is a starting point. Researchers, journalists, procurement teams, and academic users sometimes need depth this page doesn't reach: specific scanning frequency for a given layer, specific data-source attribution for a given dataset, specific verification of an ethical commitment.

Reach us through one of these channels:

  • General questions: [email protected]
  • Research access and methodology consultation: through the Researcher Program
  • Procurement-team detail: through your sales contact or Talk to us
  • Security disclosures (issues with our infrastructure): via the security disclosure surface (see Trust)
GET STARTED

See the methodology applied to your domain.

The free threat exposure report runs the methodology against your domain end-to-end. Discovery, scanning, scoring, and prioritization in one report.

Get a free threat exposure report Talk to us