Snowflake Documentation

REACHDOG.TAXONOMY | 15 tables + 2 views | ~1.07B rows | ~18 GB compressed | Updated weekly

This dataset is a fully resolved intent-to-product graph spanning the English-language e-commerce landscape. 372 million products from 714,000 online stores, clustered into 29.6 million product neighborhoods, linked to 22.3 million search intent keywords with demand signals, and enriched across five semantic dimensions. Financial signals (pricing, competition, demand, opportunity) are available for 29.6 million clusters via the FINANCIAL_METRICS table, and per-keyword CPC/competition data is available for 53.6 million keywords via the KEYWORD_FINANCIALS table.

Where traditional product taxonomies organize products into a static tree (e.g., Apparel > Women > Dresses), this dataset builds a many-to-many semantic web that connects any search keyword to the products, stores, demographics, occasions, and price trends that surround it. Any node can reach any other node through structured, queryable paths.

Table Inventory

TableRowsSizeDescription
PRODUCT_BRIDGE361M2.2 GBProduct-to-cluster mapping (every product assigned to a neighborhood)
INTENT_PRODUCT_MATCHES223M3.9 GBKeyword-to-cluster relevance scores (top matches per keyword)
WEEKLY_DIMENSIONS148M1.0 GBWeekly breakdowns across 12 dimensions (5 lifestyle + 7 financial)
PRICE_HISTORY_EVENTS95M1.3 GBIndividual price change events since Feb 2024
KEYWORD_FINANCIALS53.6M0.6 GBPer-keyword CPC, competition, search volume
PRICE_HISTORY45M1.5 GBRaw price history documents per product
MONTHLY_DIMENSIONS35M287 MBMonthly aggregate breakdowns by dimension
INTENT_LINK_EDGES32M85 MBIntent hierarchy parent-child edges
TITLE_CANONICALS29.6M4.7 GBCluster metadata: canonical title, size, enrichment tags
FINANCIAL_METRICS29.6M0.6 GB18 financial signals per cluster
INTENT_PRODUCTS22.3M786 MBMaster keyword table with search volume and categories
INTENT_LINKS1.6M401 MBIntent hierarchy parent nodes (seed/canonical)
SHOPS714K1.0 GBActive stores with enrichment tags and geographic data
WEEKLY_TOTALS59<1 MBPlatform-level weekly totals
MONTHLY_TOTALS14<1 MBPlatform-level monthly totals

Views: CLUSTER_FULL joins TITLE_CANONICALS + FINANCIAL_METRICS into a single enriched cluster row. DIM_FINANCIAL_DIMENSIONS aggregates financial dimension tags for faceted browsing.

Entity Relationships

Core join paths
INTENT_PRODUCTS (22.3M keywords) | |-- INTENT_PRODUCT_MATCHES (223M) --> TITLE_CANONICALS (29.6M clusters) | | | |-- PRODUCT_BRIDGE (361M) --> individual products | |-- PRICE_HISTORY_EVENTS (95M) | |-- FINANCIAL_METRICS (29.6M) | |-- INTENT_LINKS (1.6M) / INTENT_LINK_EDGES (32M) [keyword hierarchy] | v KEYWORD_FINANCIALS (53.6M) <-- per-keyword CPC/competition/SV SHOPS (714K) <-- connected via TOP_INTENTS[] and CATEGORIES[]

Join keys:

INTENT_PRODUCTS.KEYWORD_ID = INTENT_PRODUCT_MATCHES.KEYWORD_ID
INTENT_PRODUCT_MATCHES.CLUSTER_ID = TITLE_CANONICALS.CLUSTER_ID
TITLE_CANONICALS.CLUSTER_ID = PRODUCT_BRIDGE.CLUSTER_ID
PRODUCT_BRIDGE.PRODUCT_ID = PRICE_HISTORY_EVENTS.PRODUCT_ID
TITLE_CANONICALS.CLUSTER_ID = FINANCIAL_METRICS.CLUSTER_ID

INTENT_PRODUCTS

22,347,113 rows | The master list of every search intent keyword in the taxonomy.

ColumnTypeDescription
KEYWORD_IDINTUnique identifier
KWVARCHARThe keyword text
TYPEINT1=seed (1.0M), 2=canonical (2.1M), 3=intent (19.3M)
SVINTMeasured search volume (Google/Amazon)
PARENT_SVINTInherited search volume from parent keyword
CATSVARIANTArray of category labels

Search volume note: SV is measured directly from Google Ads or Amazon. PARENT_SV is inherited from the parent seed/canonical keyword. 82% of all keywords carry one or both signals. Never sum or average SV and PARENT_SV together.

INTENT_PRODUCT_MATCHES

223,471,130 rows | Links every keyword to its closest product neighborhoods with a relevance score.

ColumnTypeDescription
KEYWORD_IDINTFK to INTENT_PRODUCTS
CLUSTER_IDINTFK to TITLE_CANONICALS
SCOREFLOATSemantic relevance score (higher = stronger match)
TITLEVARCHARThe matched cluster's canonical title
CLUSTER_SIZEINTNumber of products in the cluster

TITLE_CANONICALS

29,637,507 rows | Each row represents a product neighborhood with its canonical title and enrichment tags.

ColumnTypeDescription
CLUSTER_IDINTUnique cluster identifier
SIZEINTNumber of products in this cluster
CANONICALVARCHARRepresentative product title (max 125 chars)
TITLESVARIANTArray of sample titles from the cluster
ENRICHMENT_TAGSVARIANTEnrichment across 5+ dimensions (see below)
HAS_PRICE_HISTORYBOOLEANWhether products in this cluster have price tracking

3.6 million clusters enriched (12.2% of all clusters). Enriched clusters cover the highest-priority neighborhoods and represent 192 million of the 372 million total products. Enrichment is ongoing.

FINANCIAL_METRICS

29,637,507 rows | One row per cluster. Joins 1:1 with TITLE_CANONICALS on CLUSTER_ID. Use the CLUSTER_FULL view for a single enriched row.

ColumnTypeCoverageDescription
CLUSTER_IDINT100%FK to TITLE_CANONICALS
PRICE_MEDIANFLOAT99%Median price across products in cluster
PRICE_MINFLOAT99%Minimum observed price
PRICE_MAXFLOAT99%Maximum observed price
SELLER_COUNTINT100%Distinct sellers in this cluster
CPCFLOAT17%Cost-per-click estimate (USD)
COMPETITIONFLOAT17%Google Ads competition score (0-1)
DEMAND_SCOREFLOAT21%Aggregated demand signal from matched keywords
OPPORTUNITY_SCOREFLOAT21%Composite: demand relative to supply and competition
OPPORTUNITY_TIERSVARIANT23%High / Medium / Low tier labels
INTENT_VOLUME_SCORENUMBER23%Composite intent volume score
OPPORTUNITY_GAP_SCOREFLOAT23%Demand-supply gap score
PRICE_MEDIAN_13WFLOAT4.7%13-week rolling median price
PRICE_MOMENTUMFLOAT4.7%Price momentum signal
DEMAND_VELOCITYFLOAT0.9%Rate of demand change

KEYWORD_FINANCIALS

53,612,008 rows | Per-keyword financial data. Includes original 22.3M intent keywords plus 30.5M voice-search questions.

ColumnTypeDescription
KIDINTUnique keyword ID
KWVARCHARKeyword or question text
TYPEINT1=seed, 2=canonical, 3=intent/question
CIDINTCluster ID (FK to TITLE_CANONICALS)
COMPETITION_INDEXFLOATGoogle Ads competition (0-100)
COMPETITION_LEVELVARCHARLow / Medium / High
CPCFLOATCost-per-click (USD)
CPC_HIGHFLOATHigh-range CPC estimate
CPC_LOWFLOATLow-range CPC estimate
SVINTMonthly search volume

PRODUCT_BRIDGE

361,388,917 rows | Maps every individual product to its cluster neighborhood.

ColumnTypeDescription
PRODUCT_IDINTUnique product identifier
CLUSTER_IDINTFK to TITLE_CANONICALS

SHOPS

714,735 rows | Active, quality-filtered online stores with full enrichment. Screened from 1.5M+ stores; 687K excluded as spam, dropshippers, or failing integrity checks.

ColumnTypeDescription
IDINTStore identifier
NAME / DOMAIN / URLVARCHARStore identification
CITY / PROVINCE / COUNTRYVARCHARGeographic location
PUBLISHED_PRODUCTS_COUNTINTNumber of listed products
TOP_INTENTSVARIANTArray of top search intents for this store (ranked by demand)
CATEGORIES / DEMOGRAPHICS / OCCASIONS / SEASONS / USE_CASESVARIANTAggregated enrichment tags
CATEGORY_SUMMARYVARIANTCategory distribution summary
SHIPS_TO_COUNTRIESVARIANTDetected shipping destinations

PRICE_HISTORY_EVENTS

95,182,652 rows | Individual price changes flattened from raw price history. Covers February 2024 to present.

ColumnTypeDescription
PRODUCT_IDINTFK to PRODUCT_BRIDGE
EVENT_DATETIMESTAMPWhen the price changed
PRICEFLOATNew price at this event
PRICE_CHANGEFLOATDelta from previous price

Price data note: Filter to PRICE BETWEEN 1 AND 10000 for reliable analysis. The $10K+ bucket contains data quality outliers (likely prices stored in minor currency units).

WEEKLY_DIMENSIONS

147,854,202 rows | Weekly snapshots tracking how the product landscape evolves across 12 dimensions.

ColumnTypeDescription
DIMENSIONVARCHARDimension name (categories, demographics, price_ranges, etc.)
VALUEVARCHARDimension value (e.g. "graphic tees", "women", "$50-100")
WEDNESDAYDATEWeek ending date
CLUSTER_COUNTINTClusters with this value that week
PRODUCT_COUNTINTProducts under this value that week

12 dimensions tracked:

categoriesuse_casesoccasionsdemographicsseasonsprice_rangescpc_tierscompetition_levelsrevenue_tiersprice_trendsmarket_positionsopportunity_tiers

INTENT_LINKS: 1,617,842 rows | Parent nodes in the intent hierarchy.
INTENT_LINK_EDGES: 32,260,516 rows | Parent-child edges. Average fan-out: ~20 intents per parent.

The hierarchy enables roll-up analytics: aggregate long-tail intents under their parent seed to compute total addressable demand for a product space.

Five Enrichment Dimensions

Every enriched cluster carries structured tags accessible via ENRICHMENT_TAGS (VARIANT column on TITLE_CANONICALS):

DimensionClusters TaggedUnique ValuesWhat It Captures
categories3,472,377968,390+Fine-grained product categories far beyond any standard taxonomy
use_cases3,475,262521,414+How the product is used (party dressing, vacation wear, home office)
occasions3,469,541111,229+When/why someone buys it (weddings, date nights, back to school)
demographics3,475,42369,842+Who buys it (women, young adults, pet owners, gamers, brides)
seasons3,466,2722,882Seasonal relevance (summer, winter, holidays, year-round)

Financial Signals

ENRICHMENT_TAGS also contains financial dimension arrays on clusters that have them:

price_ranges: Under $10, $10-25, $25-50, $50-100, $100-250, $250-500, $500+
cpc_tiers: No CPC, $0.01-0.50, $0.50-1, $1-2, $2-5, $5+
competition_levels: None, Low, Medium, High
market_positions: Budget, Mid-Range, Premium, Luxury

Sample: Market Sizing

How big is the market for a given keyword?

SQL
SELECT ip.KW, ip.SV, SUM(ipm.CLUSTER_SIZE) AS total_products, COUNT(DISTINCT ipm.CLUSTER_ID) AS competing_clusters, AVG(ipm.SCORE) AS avg_relevance FROM INTENT_PRODUCTS ip JOIN INTENT_PRODUCT_MATCHES ipm ON ip.KEYWORD_ID = ipm.KEYWORD_ID WHERE ip.KW = 'wireless earbuds' GROUP BY 1, 2;

Sample: Demand-Supply Gap Analysis

High-demand keywords with few competing products.

SQL
SELECT ip.KW, ip.SV, COUNT(DISTINCT ipm.CLUSTER_ID) AS supply_clusters, SUM(ipm.CLUSTER_SIZE) AS supply_products, ip.SV / NULLIF(SUM(ipm.CLUSTER_SIZE), 0) AS demand_per_product FROM INTENT_PRODUCTS ip JOIN INTENT_PRODUCT_MATCHES ipm ON ip.KEYWORD_ID = ipm.KEYWORD_ID WHERE ip.SV > 1000 GROUP BY 1, 2 HAVING supply_products < 100 ORDER BY demand_per_product DESC LIMIT 50;

Sample: Competitor Research

Which stores dominate a given category?

SQL
SELECT s.DOMAIN, s.COUNTRY, s.PUBLISHED_PRODUCTS_COUNT, s.ENRICHED_CLUSTERS, ARRAY_SIZE(s.TOP_INTENTS) AS intent_breadth FROM SHOPS s, LATERAL FLATTEN(INPUT => s.CATEGORIES) c WHERE c.VALUE::STRING = 'pet supplies' ORDER BY s.ENRICHED_CLUSTERS DESC LIMIT 25;

How have prices moved in a category over 2 years?

SQL
SELECT DATE_TRUNC('MONTH', phe.EVENT_DATE) AS month, AVG(phe.PRICE) AS avg_price, MEDIAN(phe.PRICE) AS median_price, COUNT(DISTINCT phe.PRODUCT_ID) AS products_tracked FROM TITLE_CANONICALS tc JOIN PRODUCT_BRIDGE pb ON tc.CLUSTER_ID = pb.CLUSTER_ID JOIN PRICE_HISTORY_EVENTS phe ON pb.PRODUCT_ID = phe.PRODUCT_ID WHERE tc.ENRICHMENT_TAGS:categories[0]::STRING = 'wireless earbuds' AND phe.PRICE BETWEEN 1 AND 10000 GROUP BY 1 ORDER BY 1;

Sample: Seasonal Demand Mapping

Which product categories are seasonal?

SQL
SELECT cat.VALUE::STRING AS category, sea.VALUE::STRING AS season, COUNT(*) AS cluster_count FROM TITLE_CANONICALS, LATERAL FLATTEN(INPUT => ENRICHMENT_TAGS:categories) cat, LATERAL FLATTEN(INPUT => ENRICHMENT_TAGS:seasons) sea WHERE ENRICHMENT_TAGS IS NOT NULL GROUP BY 1, 2 HAVING cluster_count > 50 ORDER BY 1, 3 DESC;

VARIANT Field Reference

Several columns use Snowflake's VARIANT type. Common access patterns:

VARIANT access patterns
-- Array access ENRICHMENT_TAGS:categories[0]::STRING -- first category ARRAY_SIZE(ENRICHMENT_TAGS:categories) -- category count TOP_INTENTS[0]:kw::STRING -- first intent keyword TOP_INTENTS[0]:sv::INT -- first intent's search volume -- Flatten for aggregation LATERAL FLATTEN(INPUT => ENRICHMENT_TAGS:categories) LATERAL FLATTEN(INPUT => TOP_INTENTS) LATERAL FLATTEN(INPUT => SHIPS_TO_COUNTRIES) -- Filtering WHERE ARRAY_CONTAINS('summer'::VARIANT, ENRICHMENT_TAGS:seasons) WHERE ENRICHMENT_TAGS:categories[0]::STRING = 'graphic tees'

Data Quality Notes

Price history outliers: The $10K+ bucket (32M events) contains data quality issues. Always filter PRICE BETWEEN 1 AND 10000 for reliable analysis.

Enrichment coverage: 12.2% of clusters are enriched (3.6M of 29.6M). These cover the highest-priority neighborhoods representing 192M of 372M total products. Enrichment is ongoing and expanding weekly.

Search volume: 88.6% of keywords have SV=0 (expected for long-tail intents). Use PARENT_SV for ranking those keywords. Combined, 82% of all keywords carry some demand signal.

Store quality: 687K stores were excluded as spam, dropshippers, or inactive. The 714K in the SHOPS table passed all quality filters. We screen over 1.5 million stores to produce this filtered set.

Tag normalization: Enrichment tags are free text, not from a controlled vocabulary. Some dimensions have synonyms (e.g., "all year", "year-round", "all seasons"). Group or normalize before aggregating.

Refresh Cadence

The dataset refreshes on a weekly cadence. Dimension snapshots, price tracking, financial metrics, and cluster enrichment are all updated weekly. New products, stores, and keywords enter the graph as they are discovered and processed.