Passive Recon Pipeline: How the Assessment Actually Works (Verbose)

If you read the Wolf version of this post, you got the what: passive recon, evidence of active targeting, shadow IT, a dangling subdomain, data leakage, EOL software. All real, all from a recent engagement. This post is for practitioners — the people who want to know how.

The pipeline I’m going to describe is composable, repeatable, and built entirely from open-source tooling. It’s designed to simulate what a motivated attacker does before they touch anything.

The Philosophy First

As I covered in the full post: passive recon means zero interaction with the target’s infrastructure. No packets sent, no requests made, no rules tripped. You’re reading public records that have been accumulating since the first certificate was issued and the first URL was crawled. The question you’re answering is what an attacker learns before they do anything active. The answer is almost always more than the organization realizes.

The Pipeline

The assessment runs in two parallel branches that converge at a final correlation step.

                  ┌─ dns-enum ─── shodan-enrich ──┐
ct-recon ─────────┤                               │
                  └─ (active recon, phase 2)      │
                                                  ├─ asm-report
root domain ─┬─ gau ─── url-analyze ──────────────┤
             └─ theHarvester ─────────────────────┘

Infrastructure branch: CT logs discover subdomains, DNS resolves them, Shodan enriches the IPs with what it’s already observed from its continuous internet-wide scanning.

Application branch: GAU (GetAllURLs) pulls historical URLs from Wayback Machine, Common Crawl, OTX, and URLScan. A custom analyzer classifies and extracts intelligence from those URLs.

Both branches feed into a final correlation step that produces a unified inventory.

Certificate Transparency Logs

CT logs are the starting point. Every TLS certificate issued by a trusted CA is logged to public CT logs. It’s a requirement of the CA/Browser Forum. This means every subdomain that’s ever had a certificate issued for it is permanently in a public record, searchable by domain.

Tools like crt.sh expose a queryable interface to these logs. A simple query against a root domain returns every certificate ever issued, including SANs, which often contain subdomains that don’t appear in DNS at all: preprod environments, internal tooling that someone issued a cert for, staging systems.

What you’re building here is a historical record of everything the organization has ever put a certificate on. Some of it will be long dead. All of it is signal.

DNS Enumeration

With a list of candidate subdomains from CT logs, the next step is DNS resolution. This tells you what’s actually alive right now: which subdomains resolve, to what IPs, through what CNAME chains.

CNAME chains are particularly interesting because they reveal vendor relationships. A subdomain that CNAMEs to a third-party service is telling you: this organization trusts this vendor enough to delegate DNS to them. Chain enough of those and you have a vendor inventory the organization probably doesn’t have themselves.

Cloudflare-proxied IPs are worth flagging separately. If a subdomain resolves to a CF edge IP, you’re not seeing the origin: the real IP is hidden. That’s a WAF coverage indicator, but it also means there’s something behind it you can’t see directly from DNS.

Subdomain takeover candidates surface here too. The detection is mechanical: if a CNAME points to an external provider and that provider returns a “not found” or “available” response for the target resource, the subdomain is claimable. The hard part isn’t finding them, it’s that DNS records for dead vendor relationships often outlive anyone’s awareness that a relationship ended.

Shodan Enrichment

Once you have a list of unique IPs from DNS, Shodan is where things get interesting. Shodan continuously scans the internet and catalogues what it finds: open ports, service banners, TLS certificate details, HTTP response headers. By the time you query it, it’s likely already scanned every IP in your list multiple times.

What you’re looking for:

Open ports that shouldn’t be open. SSH, RDP, database ports, admin interfaces reachable from the public internet.
Default or self-signed certificates. Indicates infrastructure that was stood up without proper TLS configuration, often shadow IT or forgotten test environments.
EOL software. Shodan captures version strings. An nginx/1.14.0 banner on a production-adjacent server is both an EOL finding and a CVE surface.
Known CVEs. Shodan’s vulnerability data cross-references service versions against CVE databases. It’ll tell you if a service version has documented exploits.
Default pages. An IP serving a default web server welcome page is infrastructure that was stood up and left running. Chances are high that it’s not monitored and probably not patched.

The key insight here is that Shodan already has this data. You’re not scanning anything. You’re reading a database of scans that happened before you started.

URL Intelligence

GAU aggregates historical URLs from multiple sources: the Wayback Machine, Common Crawl, OTX, and URLScan.io. For a mature organization, this produces tens or hundreds of thousands of URLs spanning years of crawl history.

Raw URL lists are noise. The value is in classification:

Auth and session endpoints: login flows, token endpoints, OAuth callbacks. These tell you how authentication works and where to look for weaknesses.
API endpoints: patterns like /api/v1/, /graphql, /rest/ reveal the API surface. Version numbers in paths tell you about API lifecycle management (or lack thereof).
Sensitive file types: .env, .sql, .bak, .config in URL paths. Most won’t return anything useful, but they tell you what the organization’s deployment hygiene looked like historically.
Parameters with sensitive patterns: email addresses, tokens, IDs in query strings. This is where PII leakage lives. If an application ever put a user’s email address in a URL parameter, it’s in the crawl databases forever.
Non-production hosts: URLs from staging, dev, preprod, sandbox environments that got indexed. Shadow IT discovery from the application layer instead of the infrastructure layer.

The URL archive is effectively a historical record of every mistake the application has ever made in how it constructs URLs. Some of those mistakes are fixed. The record of them isn’t.

OSINT Aggregation

theHarvester pulls email addresses, subdomains, IPs, and employee names from a broad range of OSINT sources: search engines, LinkedIn, GitHub, certificate data. This fills in the picture that the technical pipeline misses.

Employee email patterns matter because they’re the credential format attackers use in phishing and credential stuffing. A discovered pattern of firstname.lastname@company.com is intelligence. GitHub user associations can reveal contractors, third-party developers, and, occasionally, credentials or internal tooling that got committed to a public repo.

Correlating the Data

All five sources converge into a single inventory: every subdomain with its discovery source(s), every IP with its associated hosts and Shodan findings, every vendor relationship inferred from DNS, every URL classification, every email address.

The correlation step is where findings that look insignificant in isolation become significant together. A subdomain that appears in CT logs but not in DNS is a ghost: it existed once. A subdomain that resolves to an IP that Shodan says is running EOL software with known CVEs is a finding. A CNAME pointing to a provider that shows “page not found” or “this domain is available” is a critical finding.

Finding Evidence of Active Reconnaissance

This deserves its own section because it’s the most interesting part of this particular engagement.

URLScan.io is a public service where users submit URLs for analysis. The submissions are logged, timestamped, and publicly searchable. If someone is systematically submitting subdomains belonging to your client for analysis, that’s visible.

The pattern we found was consistent with automated target development: multiple subdomains submitted over a compressed timeframe, in a sequence that suggests enumeration rather than organic discovery. Not a human clicking links, but a tool working through a list. Cross-referencing the submission timestamps against CT log issuance timestamps and other public data points let us establish a timeline and confirm the pattern.

There’s no log source for “someone looked you up on URLScan.” It doesn’t show up in a SIEM. The only way to find it is to go look, which is the whole point.

What Phase 1 Doesn’t Cover

The passivity that makes this safe to run also defines its ceiling. Specifically:

You don’t know what’s running behind Cloudflare. You see the edge, not the origin.
You don’t know what the application does, only what URLs it has used historically.
You can’t validate whether a vulnerability is actually exploitable, only that the conditions for it exist.
You have no view of internal infrastructure, authenticated surfaces, or anything that requires credentials.

Phase 2 (active recon actually probing the infrastructure) and Phase 3 (vulnerability scanning and validation) go significantly deeper. Phase 1 gives you the map. Phases 2 and 3 tell you what’s on it.

Making It Repeatable

The value of a structured pipeline over ad-hoc tooling is repeatability. The same pipeline run six months later produces a delta: new subdomains, new IPs, new URL patterns, changed Shodan findings. That delta is your monitoring signal.

This is the direction the tooling is designed for: not just a one-time engagement artifact, but a continuously runnable assessment that answers “what changed in our external attack surface since last time?”

Most organizations don’t know the answer to that question. They should.