Database Schema¶

Threat Loom uses SQLite with WAL (Write-Ahead Logging) mode for persistent storage. The database file is threatlandscape.db in the project root.

Tables¶

`sources`¶

Feed source definitions. Populated from config.json on startup.

Column	Type	Constraints	Description
`id`	INTEGER	PRIMARY KEY	Auto-incrementing source ID
`name`	TEXT	NOT NULL	Display name (e.g., "The Hacker News")
`url`	TEXT	NOT NULL, UNIQUE	Feed URL
`enabled`	INTEGER	DEFAULT 1	Whether the feed is active (1=yes, 0=no)
`last_fetched`	TIMESTAMP	—	When this source was last successfully fetched

`articles`¶

Ingested articles from all sources.

Column	Type	Constraints	Description
`id`	INTEGER	PRIMARY KEY	Auto-incrementing article ID
`source_id`	INTEGER	FOREIGN KEY → sources(id)	Which feed this article came from
`title`	TEXT	NOT NULL	Article headline
`url`	TEXT	NOT NULL, UNIQUE	Original article URL (used for deduplication)
`author`	TEXT	—	Article author
`published_date`	TIMESTAMP	—	When the article was published
`fetched_date`	TIMESTAMP	DEFAULT CURRENT_TIMESTAMP	When Threat Loom ingested the article
`content_raw`	TEXT	—	Scraped article text (populated by scraper)
`image_url`	TEXT	—	Thumbnail or hero image URL

Indexes

Index	Columns	Purpose
`idx_articles_url`	`url`	Fast deduplication lookups
`idx_articles_source`	`source_id`	Filter articles by source
`idx_articles_date`	`published_date DESC`	Chronological ordering

`summaries`¶

AI-generated article summaries, tags, and attack flows.

Column	Type	Constraints	Description
`id`	INTEGER	PRIMARY KEY	Auto-incrementing summary ID
`article_id`	INTEGER	UNIQUE, FOREIGN KEY → articles(id)	One summary per article
`summary_text`	TEXT	NOT NULL	Markdown-formatted summary (executive summary, details, mitigations)
`key_points`	TEXT	—	JSON string: attack flow steps or legacy bullet points
`tags`	TEXT	—	JSON array of categorization tags
`novelty_notes`	TEXT	—	What's novel about this threat (extracted from LLM response)
`model_used`	TEXT	—	OpenAI model that generated the summary (or "failed")
`created_date`	TIMESTAMP	DEFAULT CURRENT_TIMESTAMP	When the summary was generated

Indexes

Index	Columns	Purpose
`idx_summaries_article`	`article_id`	Fast article-to-summary lookup

Notes on key_points

This field stores a JSON string. For articles with attack flows, the format is:

[
  {
    "phase": "Initial Access",
    "title": "Spearphishing attachment",
    "description": "Attacker sends weaponized document...",
    "technique": "T1566.001"
  }
]

For legacy or non-attack articles, it may contain a simple array of strings.

`article_embeddings`¶

Vector embeddings for semantic search.

Column	Type	Constraints	Description
`article_id`	INTEGER	PRIMARY KEY, FOREIGN KEY → articles(id)	One embedding per article
`embedding`	BLOB	—	numpy float32 array (1536 dimensions × 4 bytes = 6,144 bytes per row)
`model_used`	TEXT	—	Embedding model (always `"text-embedding-3-small"`)
`created_date`	TIMESTAMP	DEFAULT CURRENT_TIMESTAMP	When the embedding was generated

BLOB Format

Embeddings are stored as raw bytes from numpy.array(floats, dtype=np.float32).tobytes() and read back with numpy.frombuffer(blob, dtype=np.float32). Each embedding is a 1536-dimensional float32 vector occupying 6,144 bytes.

`category_insights`¶

Cached trend analysis and forecasts per category.

Column	Type	Constraints	Description
`id`	INTEGER	PRIMARY KEY	Auto-incrementing insight ID
`category_name`	TEXT	NOT NULL, UNIQUE	Category key: `"Malware"` or `"Threat Actors::apt29"` for subcategories
`trend_text`	TEXT	—	Markdown trend analysis (3-6 paragraphs)
`forecast_text`	TEXT	—	Markdown forecast (2-4 paragraphs)
`article_count`	INTEGER	—	Number of articles used to generate the insight
`article_hash`	TEXT	—	SHA-256 hash (first 16 chars) of sorted article content hashes
`model_used`	TEXT	—	OpenAI model that generated the insight
`created_date`	TIMESTAMP	DEFAULT CURRENT_TIMESTAMP	When the insight was generated

Cache Key Format

Category-level: "Malware", "Vulnerabilities", etc.
Subcategory-level: "Threat Actors::apt29", "Malware::lockbit", etc.

Cache Invalidation

An insight is regenerated when:

The article_hash doesn't match the current hash (new articles added)
The created_date is older than 24 hours

`trend_analyses`¶

Cached historical trend analysis reports (quarterly and yearly) per category.

Column	Type	Constraints	Description
`id`	INTEGER	PRIMARY KEY	Auto-incrementing ID
`category_name`	TEXT	NOT NULL	Category key: `"Malware"` or `"Threat Actors::apt29"` for subcategories
`period_type`	TEXT	NOT NULL	`"quarterly"` or `"yearly"`
`period_label`	TEXT	NOT NULL	Quarter label (`"2024-Q1"`) or year (`"2024"`)
`trend_text`	TEXT	—	Markdown trend analysis for the period
`article_count`	INTEGER	—	Number of articles used to generate the analysis
`article_hash`	TEXT	—	SHA-256 hash (first 16 chars) of the articles used
`model_used`	TEXT	—	LLM model that generated the analysis
`created_date`	TIMESTAMP	DEFAULT CURRENT_TIMESTAMP	When the analysis was generated

Unique Constraint

UNIQUE(category_name, period_type, period_label) — one cached entry per category per period.

Cache Invalidation

A cached entry is regenerated when the article_hash doesn't match the current set of articles for that period. There is no TTL — trend analyses are considered stable once generated. When a time-period filter is active, results are generated fresh and not stored here.

`article_correlations`¶

Relationships between related articles.

Column	Type	Constraints	Description
`id`	INTEGER	PRIMARY KEY	Auto-incrementing correlation ID
`article_id_1`	INTEGER	FOREIGN KEY → articles(id)	First article in the pair
`article_id_2`	INTEGER	FOREIGN KEY → articles(id)	Second article in the pair
`correlation_type`	TEXT	—	Type of relationship
`confidence`	REAL	—	Confidence score (0.0-1.0)
`description`	TEXT	—	Human-readable relationship description
`created_date`	TIMESTAMP	DEFAULT CURRENT_TIMESTAMP	When the correlation was identified

Entity-Relationship Overview¶

sources 1──────────* articles
                       │
                       ├──────────1 summaries
                       │
                       ├──────────1 article_embeddings
                       │
                       └──*───────* article_correlations

category_insights  (standalone, keyed by category_name)
trend_analyses     (standalone, keyed by category_name + period_type + period_label)

Each source has many articles
Each article has at most one summary and one embedding
Article correlations link pairs of articles
Category insights are independent, keyed by category name string
Trend analyses are independent, keyed by category name + period type + period label

Design Decisions¶

WAL Mode¶

SQLite is configured with PRAGMA journal_mode=WAL, enabling concurrent reads while a write is in progress. This is important because the pipeline writes data while the Flask server serves read requests.

Thread-Local Connections¶

Each thread gets its own database connection to avoid SQLite's thread-safety limitations. This is essential for the multi-threaded architecture (Flask server + scheduler + scraper ThreadPoolExecutor).

BLOB Embeddings¶

Embeddings are stored as raw numpy byte arrays rather than using a dedicated vector database. This keeps the stack simple (no Pinecone, Chroma, or Faiss dependency) and works well for datasets under ~100,000 articles. Cosine similarity is computed in Python using numpy.

Strategic Indexes¶

Indexes are placed on columns used in frequent queries:

articles.url — Deduplication on every insert
articles.source_id — Source-based filtering
articles.published_date — Chronological sorting
summaries.article_id — Article-summary JOIN

Foreign Keys¶

Foreign keys are enabled via PRAGMA foreign_keys=ON. Deleting an article cascades to its summary, embedding, and correlations.

Article Deduplication¶

The articles.url column has a UNIQUE constraint. Duplicate URLs are silently skipped during ingestion (INSERT OR IGNORE pattern).