Snowflake Documentation
REACHDOG.TAXONOMY | 15 tables + 2 views | ~1.07B rows | ~18 GB compressed | Updated weekly
This dataset is a fully resolved intent-to-product graph spanning the English-language e-commerce landscape. 372 million products from 714,000 online stores, clustered into 29.6 million product neighborhoods, linked to 22.3 million search intent keywords with demand signals, and enriched across five semantic dimensions. Financial signals (pricing, competition, demand, opportunity) are available for 29.6 million clusters via the FINANCIAL_METRICS table, and per-keyword CPC/competition data is available for 53.6 million keywords via the KEYWORD_FINANCIALS table.
Where traditional product taxonomies organize products into a static tree (e.g., Apparel > Women > Dresses), this dataset builds a many-to-many semantic web that connects any search keyword to the products, stores, demographics, occasions, and price trends that surround it. Any node can reach any other node through structured, queryable paths.
Table Inventory
| Table | Rows | Size | Description |
|---|---|---|---|
| PRODUCT_BRIDGE | 361M | 2.2 GB | Product-to-cluster mapping (every product assigned to a neighborhood) |
| INTENT_PRODUCT_MATCHES | 223M | 3.9 GB | Keyword-to-cluster relevance scores (top matches per keyword) |
| WEEKLY_DIMENSIONS | 148M | 1.0 GB | Weekly breakdowns across 12 dimensions (5 lifestyle + 7 financial) |
| PRICE_HISTORY_EVENTS | 95M | 1.3 GB | Individual price change events since Feb 2024 |
| KEYWORD_FINANCIALS | 53.6M | 0.6 GB | Per-keyword CPC, competition, search volume |
| PRICE_HISTORY | 45M | 1.5 GB | Raw price history documents per product |
| MONTHLY_DIMENSIONS | 35M | 287 MB | Monthly aggregate breakdowns by dimension |
| INTENT_LINK_EDGES | 32M | 85 MB | Intent hierarchy parent-child edges |
| TITLE_CANONICALS | 29.6M | 4.7 GB | Cluster metadata: canonical title, size, enrichment tags |
| FINANCIAL_METRICS | 29.6M | 0.6 GB | 18 financial signals per cluster |
| INTENT_PRODUCTS | 22.3M | 786 MB | Master keyword table with search volume and categories |
| INTENT_LINKS | 1.6M | 401 MB | Intent hierarchy parent nodes (seed/canonical) |
| SHOPS | 714K | 1.0 GB | Active stores with enrichment tags and geographic data |
| WEEKLY_TOTALS | 59 | <1 MB | Platform-level weekly totals |
| MONTHLY_TOTALS | 14 | <1 MB | Platform-level monthly totals |
Views: CLUSTER_FULL joins TITLE_CANONICALS + FINANCIAL_METRICS into a single enriched cluster row. DIM_FINANCIAL_DIMENSIONS aggregates financial dimension tags for faceted browsing.
Entity Relationships
Join keys:
INTENT_PRODUCTS.KEYWORD_ID = INTENT_PRODUCT_MATCHES.KEYWORD_ID
INTENT_PRODUCT_MATCHES.CLUSTER_ID = TITLE_CANONICALS.CLUSTER_ID
TITLE_CANONICALS.CLUSTER_ID = PRODUCT_BRIDGE.CLUSTER_ID
PRODUCT_BRIDGE.PRODUCT_ID = PRICE_HISTORY_EVENTS.PRODUCT_ID
TITLE_CANONICALS.CLUSTER_ID = FINANCIAL_METRICS.CLUSTER_ID
INTENT_PRODUCTS
22,347,113 rows | The master list of every search intent keyword in the taxonomy.
| Column | Type | Description |
|---|---|---|
| KEYWORD_ID | INT | Unique identifier |
| KW | VARCHAR | The keyword text |
| TYPE | INT | 1=seed (1.0M), 2=canonical (2.1M), 3=intent (19.3M) |
| SV | INT | Measured search volume (Google/Amazon) |
| PARENT_SV | INT | Inherited search volume from parent keyword |
| CATS | VARIANT | Array of category labels |
Search volume note: SV is measured directly from Google Ads or Amazon. PARENT_SV is inherited from the parent seed/canonical keyword. 82% of all keywords carry one or both signals. Never sum or average SV and PARENT_SV together.
INTENT_PRODUCT_MATCHES
223,471,130 rows | Links every keyword to its closest product neighborhoods with a relevance score.
| Column | Type | Description |
|---|---|---|
| KEYWORD_ID | INT | FK to INTENT_PRODUCTS |
| CLUSTER_ID | INT | FK to TITLE_CANONICALS |
| SCORE | FLOAT | Semantic relevance score (higher = stronger match) |
| TITLE | VARCHAR | The matched cluster's canonical title |
| CLUSTER_SIZE | INT | Number of products in the cluster |
TITLE_CANONICALS
29,637,507 rows | Each row represents a product neighborhood with its canonical title and enrichment tags.
| Column | Type | Description |
|---|---|---|
| CLUSTER_ID | INT | Unique cluster identifier |
| SIZE | INT | Number of products in this cluster |
| CANONICAL | VARCHAR | Representative product title (max 125 chars) |
| TITLES | VARIANT | Array of sample titles from the cluster |
| ENRICHMENT_TAGS | VARIANT | Enrichment across 5+ dimensions (see below) |
| HAS_PRICE_HISTORY | BOOLEAN | Whether products in this cluster have price tracking |
3.6 million clusters enriched (12.2% of all clusters). Enriched clusters cover the highest-priority neighborhoods and represent 192 million of the 372 million total products. Enrichment is ongoing.
FINANCIAL_METRICS
29,637,507 rows | One row per cluster. Joins 1:1 with TITLE_CANONICALS on CLUSTER_ID. Use the CLUSTER_FULL view for a single enriched row.
| Column | Type | Coverage | Description |
|---|---|---|---|
| CLUSTER_ID | INT | 100% | FK to TITLE_CANONICALS |
| PRICE_MEDIAN | FLOAT | 99% | Median price across products in cluster |
| PRICE_MIN | FLOAT | 99% | Minimum observed price |
| PRICE_MAX | FLOAT | 99% | Maximum observed price |
| SELLER_COUNT | INT | 100% | Distinct sellers in this cluster |
| CPC | FLOAT | 17% | Cost-per-click estimate (USD) |
| COMPETITION | FLOAT | 17% | Google Ads competition score (0-1) |
| DEMAND_SCORE | FLOAT | 21% | Aggregated demand signal from matched keywords |
| OPPORTUNITY_SCORE | FLOAT | 21% | Composite: demand relative to supply and competition |
| OPPORTUNITY_TIERS | VARIANT | 23% | High / Medium / Low tier labels |
| INTENT_VOLUME_SCORE | NUMBER | 23% | Composite intent volume score |
| OPPORTUNITY_GAP_SCORE | FLOAT | 23% | Demand-supply gap score |
| PRICE_MEDIAN_13W | FLOAT | 4.7% | 13-week rolling median price |
| PRICE_MOMENTUM | FLOAT | 4.7% | Price momentum signal |
| DEMAND_VELOCITY | FLOAT | 0.9% | Rate of demand change |
KEYWORD_FINANCIALS
53,612,008 rows | Per-keyword financial data. Includes original 22.3M intent keywords plus 30.5M voice-search questions.
| Column | Type | Description |
|---|---|---|
| KID | INT | Unique keyword ID |
| KW | VARCHAR | Keyword or question text |
| TYPE | INT | 1=seed, 2=canonical, 3=intent/question |
| CID | INT | Cluster ID (FK to TITLE_CANONICALS) |
| COMPETITION_INDEX | FLOAT | Google Ads competition (0-100) |
| COMPETITION_LEVEL | VARCHAR | Low / Medium / High |
| CPC | FLOAT | Cost-per-click (USD) |
| CPC_HIGH | FLOAT | High-range CPC estimate |
| CPC_LOW | FLOAT | Low-range CPC estimate |
| SV | INT | Monthly search volume |
PRODUCT_BRIDGE
361,388,917 rows | Maps every individual product to its cluster neighborhood.
| Column | Type | Description |
|---|---|---|
| PRODUCT_ID | INT | Unique product identifier |
| CLUSTER_ID | INT | FK to TITLE_CANONICALS |
SHOPS
714,735 rows | Active, quality-filtered online stores with full enrichment. Screened from 1.5M+ stores; 687K excluded as spam, dropshippers, or failing integrity checks.
| Column | Type | Description |
|---|---|---|
| ID | INT | Store identifier |
| NAME / DOMAIN / URL | VARCHAR | Store identification |
| CITY / PROVINCE / COUNTRY | VARCHAR | Geographic location |
| PUBLISHED_PRODUCTS_COUNT | INT | Number of listed products |
| TOP_INTENTS | VARIANT | Array of top search intents for this store (ranked by demand) |
| CATEGORIES / DEMOGRAPHICS / OCCASIONS / SEASONS / USE_CASES | VARIANT | Aggregated enrichment tags |
| CATEGORY_SUMMARY | VARIANT | Category distribution summary |
| SHIPS_TO_COUNTRIES | VARIANT | Detected shipping destinations |
PRICE_HISTORY_EVENTS
95,182,652 rows | Individual price changes flattened from raw price history. Covers February 2024 to present.
| Column | Type | Description |
|---|---|---|
| PRODUCT_ID | INT | FK to PRODUCT_BRIDGE |
| EVENT_DATE | TIMESTAMP | When the price changed |
| PRICE | FLOAT | New price at this event |
| PRICE_CHANGE | FLOAT | Delta from previous price |
Price data note: Filter to PRICE BETWEEN 1 AND 10000 for reliable analysis. The $10K+ bucket contains data quality outliers (likely prices stored in minor currency units).
WEEKLY_DIMENSIONS
147,854,202 rows | Weekly snapshots tracking how the product landscape evolves across 12 dimensions.
| Column | Type | Description |
|---|---|---|
| DIMENSION | VARCHAR | Dimension name (categories, demographics, price_ranges, etc.) |
| VALUE | VARCHAR | Dimension value (e.g. "graphic tees", "women", "$50-100") |
| WEDNESDAY | DATE | Week ending date |
| CLUSTER_COUNT | INT | Clusters with this value that week |
| PRODUCT_COUNT | INT | Products under this value that week |
12 dimensions tracked:
INTENT_LINKS / INTENT_LINK_EDGES
INTENT_LINKS: 1,617,842 rows | Parent nodes in the intent hierarchy.
INTENT_LINK_EDGES: 32,260,516 rows | Parent-child edges. Average fan-out: ~20 intents per parent.
The hierarchy enables roll-up analytics: aggregate long-tail intents under their parent seed to compute total addressable demand for a product space.
Five Enrichment Dimensions
Every enriched cluster carries structured tags accessible via ENRICHMENT_TAGS (VARIANT column on TITLE_CANONICALS):
| Dimension | Clusters Tagged | Unique Values | What It Captures |
|---|---|---|---|
| categories | 3,472,377 | 968,390+ | Fine-grained product categories far beyond any standard taxonomy |
| use_cases | 3,475,262 | 521,414+ | How the product is used (party dressing, vacation wear, home office) |
| occasions | 3,469,541 | 111,229+ | When/why someone buys it (weddings, date nights, back to school) |
| demographics | 3,475,423 | 69,842+ | Who buys it (women, young adults, pet owners, gamers, brides) |
| seasons | 3,466,272 | 2,882 | Seasonal relevance (summer, winter, holidays, year-round) |
Financial Signals
ENRICHMENT_TAGS also contains financial dimension arrays on clusters that have them:
Sample: Market Sizing
How big is the market for a given keyword?
Sample: Demand-Supply Gap Analysis
High-demand keywords with few competing products.
Sample: Competitor Research
Which stores dominate a given category?
Sample: Price Trend Analysis
How have prices moved in a category over 2 years?
Sample: Seasonal Demand Mapping
Which product categories are seasonal?
VARIANT Field Reference
Several columns use Snowflake's VARIANT type. Common access patterns:
Data Quality Notes
Price history outliers: The $10K+ bucket (32M events) contains data quality issues. Always filter PRICE BETWEEN 1 AND 10000 for reliable analysis.
Enrichment coverage: 12.2% of clusters are enriched (3.6M of 29.6M). These cover the highest-priority neighborhoods representing 192M of 372M total products. Enrichment is ongoing and expanding weekly.
Search volume: 88.6% of keywords have SV=0 (expected for long-tail intents). Use PARENT_SV for ranking those keywords. Combined, 82% of all keywords carry some demand signal.
Store quality: 687K stores were excluded as spam, dropshippers, or inactive. The 714K in the SHOPS table passed all quality filters. We screen over 1.5 million stores to produce this filtered set.
Tag normalization: Enrichment tags are free text, not from a controlled vocabulary. Some dimensions have synonyms (e.g., "all year", "year-round", "all seasons"). Group or normalize before aggregating.
Refresh Cadence
The dataset refreshes on a weekly cadence. Dimension snapshots, price tracking, financial metrics, and cluster enrichment are all updated weekly. New products, stores, and keywords enter the graph as they are discovered and processed.