{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "67dd7ece",
   "metadata": {},
   "source": [
    "# SUBSIDE Science Backbone — Setup & Preprocessing\n",
    "\n",
    "**Companion to:** `SUBSIDE_Science_Backbone.ipynb` (the main analysis notebook).\n",
    "\n",
    "This notebook produces a **referenced, reusable science backbone** — a JSON file\n",
    "the main notebook loads as **Tier 0**, ahead of the ETO/sbp/inlined waterfall it\n",
    "already supports. Build it once, commit it alongside the corpus, and every\n",
    "downstream analysis is anchored on the same, citable reference frame.\n",
    "\n",
    "### Why a separate setup step?\n",
    "\n",
    "Two ideas from the scientometrics literature motivate the split:\n",
    "\n",
    "1. **A science basemap is a reference system, not a result.** Börner, Klavans,\n",
    "   Boyack and colleagues frame the UCSD Map of Science as the cartographic\n",
    "   analogue of Mercator's world map — a stable scaffold on which heterogeneous\n",
    "   *data overlays* are placed [1, 2]. Treating the backbone as a setup artifact\n",
    "   keeps it out of the experiment loop.\n",
    "2. **A consensus map outperforms any single map.** Klavans & Boyack (2009)\n",
    "   showed that pooling 20 distinct global maps of science yields a more stable\n",
    "   ordering of disciplines than any one source [3]. The same logic applies\n",
    "   locally: combining the **paper-level, current ETO Map of Science** [4] with\n",
    "   the **journal-level UCSD canonical 13-discipline scaffold** [1] produces a\n",
    "   backbone more robust than either alone.\n",
    "\n",
    "### Three layers, decreasing verification → increasing locality\n",
    "\n",
    "| Layer | Source | What it is | When to use |\n",
    "|---|---|---|---|\n",
    "| **A** | [ETO Map of Science](https://sciencemap.eto.tech/) cluster CSV export | Paper-level Leiden clusters over the Merged Academic Corpus (~92,000 clusters), Klavans/Boyack lineage methodology updated to OpenAlex/Semantic Scholar/Web of Science [4] | Most current, paper-level granularity. Requires one manual export step. |\n",
    "| **B** | UCSD Map of Science 13-discipline / 554-subdiscipline classification [1] | Journal-level, published 2012 (10-year update covering 2001–2010, ~25,000 journals), CC BY-NC-SA 3.0 | Stable, citable canonical scaffold. Use when ETO is unavailable or when you want a cross-corpus-comparable reference frame. |\n",
    "| **C** | SUBSIDE-tuned inline fallback (lives in `SUBSIDE_Science_Backbone.ipynb` §5) | Curated for subsidence vocabulary | Local, fast-start, intentionally narrow — for early iteration only. |\n",
    "\n",
    "The output of this notebook is a single JSON file (`science_backbone.json`) that\n",
    "the main notebook reads via a Tier-0 hook. **You can re-run the main notebook\n",
    "many times without re-running this one.**\n",
    "\n",
    "### References\n",
    "\n",
    "1. Börner K, Klavans R, Patek M, Zoss AM, Biberstine JR, Light RP, Larivière V, Boyack KW (2012). *Design and Update of a Classification System: The UCSD Map of Science.* PLOS ONE 7(7): e39464. https://doi.org/10.1371/journal.pone.0039464\n",
    "2. Shiffrin RM, Börner K (2004). *Mapping Knowledge Domains.* PNAS 101 (Suppl. 1): 5183–5185. https://doi.org/10.1073/pnas.0307852100\n",
    "3. Klavans R, Boyack KW (2009). *Toward a Consensus Map of Science.* JASIST 60(3): 455–476. https://doi.org/10.1002/asi.20991\n",
    "4. Emerging Technology Observatory, Center for Security and Emerging Technology, Georgetown University. *ETO Map of Science.* https://sciencemap.eto.tech/ — methodology at https://eto.tech/dataset-docs/mac-clusters/\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb29f2ea",
   "metadata": {},
   "source": [
    "## 1. Setup\n",
    "\n",
    "Paths, imports, and the BibTeX corpus loader from the main notebook.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8d1b939e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BibTeX source : /work/01813/sawp33/MySUBSIDE/Global_SUBSIDE_2024.bib  (exists = True)\n",
      "Output dir    : /scratch/01813/sawp33/tapis/b673a5f6-e828-4ff9-add8-fd125f967b4a-007/work/dso_cookbook_fixes/ScienceBackboneResults\n",
      "Backbone JSON : /scratch/01813/sawp33/tapis/b673a5f6-e828-4ff9-add8-fd125f967b4a-007/work/dso_cookbook_fixes/ScienceBackboneResults/science_backbone.json\n",
      "ETO CSV path  : /work/01813/sawp33/ls6/dso_cookbook_fixes/eto-map-of-scienceWaterResources.csv\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "import os, re, json\n",
    "from datetime import datetime\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# ── Paths (mirror SUBSIDE_Science_Backbone.ipynb §1.1) ───────────────\n",
    "CANDIDATE_BIB_PATHS = [\n",
    "    Path(\"/work/01813/sawp33/MySUBSIDE/Global_SUBSIDE_2024.bib\"),\n",
    "    Path(os.getcwd()) / \"Global_SUBSIDE_2024.bib\",\n",
    "    Path(os.getcwd()) / \"data\" / \"Global_SUBSIDE_2024.bib\",\n",
    "]\n",
    "BIB_PATH = next((p for p in CANDIDATE_BIB_PATHS if p.exists()),\n",
    "                CANDIDATE_BIB_PATHS[0])\n",
    "\n",
    "OUTPUT_DIR = Path(os.getcwd()) / \"ScienceBackboneResults\"\n",
    "OUTPUT_DIR.mkdir(exist_ok=True)\n",
    "\n",
    "# Where the main notebook will look for the produced backbone JSON\n",
    "BACKBONE_JSON_PATH = OUTPUT_DIR / \"science_backbone.json\"\n",
    "\n",
    "# If you've already exported an ETO Map of Science CSV, point at it here.\n",
    "# Leave None to skip Layer A entirely.\n",
    "import os\n",
    "\n",
    "ETO_CSV_PATH = Path(os.environ[\"WORK\"]) / \"dso_cookbook_fixes\" / \"eto-map-of-scienceWaterResources.csv\"\n",
    "\n",
    "print(f\"BibTeX source : {BIB_PATH}  (exists = {BIB_PATH.exists()})\")\n",
    "print(f\"Output dir    : {OUTPUT_DIR}\")\n",
    "print(f\"Backbone JSON : {BACKBONE_JSON_PATH}\")\n",
    "print(f\"ETO CSV path  : {ETO_CSV_PATH}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e05fbeeb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✓ Imports complete\n"
     ]
    }
   ],
   "source": [
    "# Dependency check — same set as the main notebook\n",
    "import sys, subprocess, importlib\n",
    "\n",
    "def _ensure(pkg, import_name=None):\n",
    "    try:\n",
    "        importlib.import_module(import_name or pkg)\n",
    "    except ImportError:\n",
    "        subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", pkg])\n",
    "\n",
    "for _p, _m in [(\"bibtexparser\", \"bibtexparser\"),\n",
    "               (\"scikit-learn\", \"sklearn\"),\n",
    "               (\"plotly\", \"plotly\")]:\n",
    "    _ensure(_p, _m)\n",
    "\n",
    "import bibtexparser\n",
    "from bibtexparser.bparser import BibTexParser\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "print(\"✓ Imports complete\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3867c6a7",
   "metadata": {},
   "source": [
    "## 2. Summarize your corpus for ETO query construction\n",
    "\n",
    "Before going to the ETO Map UI, we need a short list of distinctive search\n",
    "terms that describe the corpus well. These are the words you'll paste into\n",
    "the Map's search bar so that the cluster export covers the relevant slice of\n",
    "science instead of all 92,000 clusters.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "e392c416",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✓ Loaded 8,409 entries from Global_SUBSIDE_2024.bib\n"
     ]
    }
   ],
   "source": [
    "def load_bib_quick(path):\n",
    "    '''Minimal BibTeX loader — just enough to harvest keywords/abstracts.'''\n",
    "    parser = BibTexParser(common_strings=True)\n",
    "    parser.ignore_nonstandard_types = False\n",
    "    parser.homogenize_fields = True\n",
    "    with open(path, \"r\", encoding=\"utf-8\", errors=\"replace\") as fh:\n",
    "        return bibtexparser.load(fh, parser=parser).entries\n",
    "\n",
    "\n",
    "def _clean(text):\n",
    "    if not text:\n",
    "        return \"\"\n",
    "    t = re.sub(r\"[{}\\\\]\", \" \", str(text))\n",
    "    return re.sub(r\"\\s+\", \" \", t).strip()\n",
    "\n",
    "\n",
    "if BIB_PATH.exists():\n",
    "    raw = load_bib_quick(BIB_PATH)\n",
    "    rows = []\n",
    "    for e in raw:\n",
    "        rows.append({\n",
    "            \"title\":    _clean(e.get(\"title\")),\n",
    "            \"abstract\": _clean(e.get(\"abstract\")),\n",
    "            \"keywords\": _clean(e.get(\"keywords\")),\n",
    "        })\n",
    "    df_quick = pd.DataFrame(rows)\n",
    "    df_quick[\"text_content\"] = (df_quick[\"title\"].fillna(\"\") + \" . \" +\n",
    "                                df_quick[\"abstract\"].fillna(\"\") + \" . \" +\n",
    "                                df_quick[\"keywords\"].fillna(\"\"))\n",
    "    df_quick = df_quick[df_quick[\"text_content\"].str.strip().str.len() > 0]\n",
    "    print(f\"✓ Loaded {len(df_quick):,} entries from {BIB_PATH.name}\")\n",
    "else:\n",
    "    df_quick = pd.DataFrame(columns=[\"title\",\"abstract\",\"keywords\",\"text_content\"])\n",
    "    print(f\"  ⚠ {BIB_PATH} not found — Section 4 below will still work but with empty input\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a1a917f",
   "metadata": {},
   "source": [
    "### 2.1 Top distinctive terms\n",
    "\n",
    "A short, ranked list of bigrams + unigrams that distinguish this corpus. Use\n",
    "the top 10–20 as your ETO search seeds.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "501e354a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top 25 distinctive terms (paste into the ETO Map search bar):\n",
      "\n",
      "   1. groundwater\n",
      "   2. land\n",
      "   3. water\n",
      "   4. land subsidence\n",
      "   5. level\n",
      "   6. mining\n",
      "   7. surface\n",
      "   8. deformation\n",
      "   9. area\n",
      "  10. sea\n",
      "  11. basin\n",
      "  12. data\n",
      "  13. model\n",
      "  14. study\n",
      "  15. ground\n",
      "  16. coastal\n",
      "  17. areas\n",
      "  18. soil\n",
      "  19. results\n",
      "  20. nan\n",
      "  21. sea level\n",
      "  22. using\n",
      "  23. high\n",
      "  24. coal\n",
      "  25. analysis\n"
     ]
    }
   ],
   "source": [
    "ETO_SEARCH_SEEDS_N = 25\n",
    "\n",
    "if len(df_quick):\n",
    "    vec = TfidfVectorizer(\n",
    "        max_features=2000,\n",
    "        ngram_range=(1, 2),\n",
    "        min_df=3,\n",
    "        max_df=0.7,\n",
    "        stop_words=\"english\",\n",
    "    )\n",
    "    X = vec.fit_transform(df_quick[\"text_content\"].tolist())\n",
    "    means = np.asarray(X.mean(axis=0)).ravel()\n",
    "    vocab = np.array(vec.get_feature_names_out())\n",
    "    order = means.argsort()[::-1]\n",
    "    # Filter out pure digits / 1-char tokens\n",
    "    seeds = [vocab[i] for i in order\n",
    "             if not vocab[i].isdigit() and len(vocab[i]) > 2][:ETO_SEARCH_SEEDS_N]\n",
    "    print(f\"Top {len(seeds)} distinctive terms (paste into the ETO Map search bar):\\n\")\n",
    "    for i, s in enumerate(seeds, 1):\n",
    "        print(f\"  {i:2d}. {s}\")\n",
    "else:\n",
    "    seeds = []\n",
    "    print(\"(no entries — skipping)\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0dc8492",
   "metadata": {},
   "source": [
    "## 3. Export the ETO Map of Science cluster slice (Layer A)\n",
    "\n",
    "The ETO Map's cluster metadata isn't a public bulk download — the underlying\n",
    "Research Cluster Dataset has commercial-license restrictions from Clarivate\n",
    "and others. But the Map UI lets you **build a query, switch to list view,\n",
    "and download the matched clusters as CSV** for any analytic environment. That\n",
    "CSV is what we ingest here.\n",
    "\n",
    "> **Per ETO's terms of use:** when you cite work derived from the Map of Science,\n",
    "> attribute it as `\"ETO Map of Science\"` and link to `https://sciencemap.eto.tech/`.\n",
    "\n",
    "### Click-by-click steps\n",
    "\n",
    "1. **Open** [https://sciencemap.eto.tech/](https://sciencemap.eto.tech/) in a browser.\n",
    "2. **Search** using a few of the distinctive terms you saw in §2.1\n",
    "   (e.g. *land subsidence*, *insar*, *groundwater*). The search box is at the top\n",
    "   of the left-hand filter panel.\n",
    "3. **Refine with filters** if you want — country, growth rating, subject\n",
    "   filters in the left sidebar. For a literature-scoping run you usually want\n",
    "   to leave growth/citation filters wide open.\n",
    "4. **Switch to list view** using the toggle near the top of the results.\n",
    "5. **Customize columns**. The columns you'll want for the SUBSIDE backbone are:\n",
    "   `cluster_id`, `cluster_title`, `cluster_summary` (or `summary`),\n",
    "   `disciplines` (3 top-scoring), `fields` (3 top-scoring), `subfields`,\n",
    "   `topics`, `key_concepts`, `map_x` / `map_y` if available, `size`.\n",
    "6. **Download CSV** — the export button is in the list-view header. Save the\n",
    "   file somewhere this notebook can read (e.g. next to your `.bib`).\n",
    "7. **Set `ETO_CSV_PATH`** in §1 above to that file's path, then re-run.\n",
    "\n",
    "Layers B (UCSD canonical) and C (SUBSIDE inline) continue to work even if you\n",
    "skip the ETO export — Tier 0 will pick up whichever layer(s) you've populated.\n",
    "\n",
    "The ETO export comes from a Leiden-clustered network that combines between-article\n",
    "**citations** with text-embedding similarity (multilingual SentenceTransformer),\n",
    "a contemporary descendant of the bibliographic coupling and co-citation\n",
    "techniques pioneered by Boyack, Klavans, and Börner.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ce82cdc",
   "metadata": {},
   "source": [
    "### 3.1 Ingest the ETO cluster CSV\n",
    "\n",
    "Tolerant loader: the column names vary somewhat across Map-of-Science updates,\n",
    "so we look for the *shape* of each field rather than relying on exact names.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "047c828c-9e4f-4d2c-9120-e00a32c8e7ea",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  ✗ Failed to load ETO CSV: NameError: name 'resolve_columns' is not defined\n",
      "(Layer A: empty — no clusters mapped)\n"
     ]
    }
   ],
   "source": [
    "#PATCH FOR 3.1 and 3.2 cells below\n",
    "\"\"\"\n",
    "Drop-in replacement for §3.1 and §3.2 of SUBSIDE_ScienceBackbone_Setup.ipynb.\n",
    "\n",
    "Handles the May-2026 ETO Map of Science export format, where the previous\n",
    "hierarchical disciplines/fields/topics columns have been consolidated into\n",
    "a single 'Most common research field' column with 12 top-level values.\n",
    "\n",
    "Three configurable knobs near the top:\n",
    "  ETO_TITLE_FILTER       — list of patterns; keep clusters mentioning any of them\n",
    "  ETO_FIELD_WHITELIST    — restrict to a subset of ETO's 12 fields (None = all)\n",
    "  N_SUBS_PER_DOMAIN      — cap on subdisciplines per UCSD-mapped domain\n",
    "\"\"\"\n",
    "\n",
    "from collections import defaultdict, Counter\n",
    "\n",
    "# ─────────────────────────────────────────────────────────────────────\n",
    "# §3.1 — Ingest the ETO cluster CSV (new May 2026 single-field format)\n",
    "# ─────────────────────────────────────────────────────────────────────\n",
    "\n",
    "# After _norm() strips whitespace/underscore/dash, 'Most common research field'\n",
    "# normalizes to 'mostcommonresearchfield' — we recognize that and a few others.\n",
    "COL_CANDIDATES = {\n",
    "    \"cluster_id\": [\"clusterid\", \"id\", \"cluster\"],\n",
    "    \"title\":      [\"clustertitle\", \"title\", \"name\"],\n",
    "    \"summary\":    [\"clustersummary\", \"summary\", \"description\"],\n",
    "    \"field\":      [\"mostcommonresearchfield\", \"mostcommonfield\", \"researchfield\",\n",
    "                   \"field\", \"disciplines\", \"discipline\"],\n",
    "    \"size\":       [\"clustersize\", \"size\", \"narticles\", \"articlecount\"],\n",
    "    \"growth\":     [\"growthrating\", \"growth\"],\n",
    "    \"citation\":   [\"citationrating\", \"citation\"],\n",
    "}\n",
    "\n",
    "# ETO's 12 top-level fields → UCSD's 13 canonical disciplines.\n",
    "# A few collapses are unavoidable (medicine spans 4 UCSD disciplines, materials\n",
    "# science could go to either Chemistry or CMCE) — we pick the closest single\n",
    "# UCSD bucket so the Layer A → Layer B merge in §5 is unambiguous.\n",
    "ETO_FIELD_TO_UCSD = {\n",
    "    \"earth science\":      \"Earth Sciences\",\n",
    "    \"biology\":            \"Biology\",\n",
    "    \"chemistry\":          \"Chemistry\",\n",
    "    \"materials science\":  \"Chemistry\",\n",
    "    \"computer science\":   \"Electrical Engineering & Computer Science\",\n",
    "    \"engineering\":        \"Chemical, Mechanical, & Civil Engineering\",\n",
    "    \"mathematics\":        \"Math & Physics\",\n",
    "    \"physics\":            \"Math & Physics\",\n",
    "    \"medicine\":           \"Medical Specialties\",\n",
    "    \"humanities\":         \"Humanities\",\n",
    "    \"social science\":     \"Social Sciences\",\n",
    "    \"business\":           \"Social Sciences\",\n",
    "}\n",
    "\n",
    "# SUBSIDE-relevant cluster filter. The full export is 91,585 clusters spanning\n",
    "# all of science; we keep only those whose title or summary mentions at least\n",
    "# one of these terms. Edit freely. Set ETO_TITLE_FILTER = None to keep all.\n",
    "ETO_TITLE_FILTER = [\n",
    "    \"subsidence\", \"compaction\", \"consolidation\", \"uplift\", \"rebound\", \"heave\",\n",
    "    \"groundwater\", \"aquifer\", \"hydrolog\", \"watershed\", \"phreatic\",\n",
    "    \"oil\", \"petroleum\", \"reservoir\", \"co2 storage\", \"sequestration\",\n",
    "    \"mining\", \"longwall\", \"coal mine\",\n",
    "    \"tectonic\", \"fault\", \"seismic\", \"geodet\",\n",
    "    \"sediment\", \"delta\", \"alluvial\",\n",
    "    \"insar\", \"interferometric\", \"gnss\", \"lidar\", \"remote sens\",\n",
    "    \"coastal\", \"shoreline\", \"tidal\", \"estuar\", \"sea level\", \"sea-level\", \"marsh\",\n",
    "    \"flood\", \"inundation\", \"storm surge\",\n",
    "    \"modflow\", \"groundwater model\",\n",
    "    \"land use\", \"land subsidence\", \"urban infrastructure\",\n",
    "    \"water resource\", \"water management\",\n",
    "]\n",
    "# Optional second filter: restrict to a subset of ETO's 12 fields.\n",
    "# Helpful if you want only earth/engineering/materials clusters regardless of title.\n",
    "ETO_FIELD_WHITELIST = None  # e.g. [\"earth science\", \"engineering\", \"materials science\"]\n",
    "\n",
    "\n",
    "eto_clusters = None\n",
    "if ETO_CSV_PATH and Path(ETO_CSV_PATH).exists():\n",
    "    try:\n",
    "        eto_raw = pd.read_csv(ETO_CSV_PATH)\n",
    "        cols = resolve_columns(eto_raw)\n",
    "        print(f\"  Loaded {len(eto_raw):,} rows from {ETO_CSV_PATH}\")\n",
    "        print(f\"  Resolved columns: {cols}\")\n",
    "\n",
    "        n_orig = len(eto_raw)\n",
    "\n",
    "        # 1. Optional field whitelist\n",
    "        if ETO_FIELD_WHITELIST and \"field\" in cols:\n",
    "            wl = {f.lower() for f in ETO_FIELD_WHITELIST}\n",
    "            mask = eto_raw[cols[\"field\"]].astype(str).str.lower().isin(wl)\n",
    "            eto_raw = eto_raw[mask]\n",
    "            print(f\"  Field whitelist:  {n_orig:,} → {len(eto_raw):,}\")\n",
    "\n",
    "        # 2. Optional title/summary filter\n",
    "        if ETO_TITLE_FILTER and \"title\" in cols:\n",
    "            pattern = \"|\".join(re.escape(t) for t in ETO_TITLE_FILTER)\n",
    "            search_text = eto_raw[cols[\"title\"]].fillna(\"\").astype(str)\n",
    "            if \"summary\" in cols:\n",
    "                search_text = (search_text + \" \" +\n",
    "                               eto_raw[cols[\"summary\"]].fillna(\"\").astype(str))\n",
    "            mask = search_text.str.lower().str.contains(pattern, regex=True, na=False)\n",
    "            n_pre = len(eto_raw)\n",
    "            eto_raw = eto_raw[mask]\n",
    "            print(f\"  Title/summary filter: {n_pre:,} → {len(eto_raw):,}\")\n",
    "\n",
    "        # 3. Normalize to dicts\n",
    "        recs = []\n",
    "        for idx, row in eto_raw.iterrows():\n",
    "            rec = {\"raw_row_index\": int(idx)}\n",
    "            for canon, actual in cols.items():\n",
    "                val = row[actual]\n",
    "                rec[canon] = val if pd.notna(val) else None\n",
    "            recs.append(rec)\n",
    "        eto_clusters = recs\n",
    "\n",
    "        print(f\"  ✓ Normalized {len(eto_clusters):,} ETO clusters after filtering\")\n",
    "        if eto_clusters:\n",
    "            fc = Counter(c.get(\"field\") for c in eto_clusters).most_common()\n",
    "            print(f\"\\n  Field distribution after filter (→ UCSD mapping):\")\n",
    "            for f, n in fc:\n",
    "                ucsd = ETO_FIELD_TO_UCSD.get(str(f).lower(), \"(unmapped)\")\n",
    "                print(f\"    {str(f):24s} {n:>5,}  → {ucsd}\")\n",
    "    except Exception as e:\n",
    "        print(f\"  ✗ Failed to load ETO CSV: {type(e).__name__}: {e}\")\n",
    "elif ETO_CSV_PATH:\n",
    "    print(f\"  ⚠ ETO_CSV_PATH set but file not found: {ETO_CSV_PATH}\")\n",
    "else:\n",
    "    print(\"  (ETO_CSV_PATH not set — skipping Layer A; using Layer B alone)\")\n",
    "\n",
    "\n",
    "# ─────────────────────────────────────────────────────────────────────\n",
    "# §3.2 — Build Layer A from the normalized clusters\n",
    "# ─────────────────────────────────────────────────────────────────────\n",
    "\n",
    "N_SUBS_PER_DOMAIN = 25   # cap; the largest clusters per domain become subs\n",
    "N_TERMS_PER_DOMAIN = 80\n",
    "\n",
    "\n",
    "def build_layer_A_backbone(clusters):\n",
    "    \"\"\"Build Layer A from ETO clusters in the May-2026 single-field format.\n",
    "\n",
    "    Strategy:\n",
    "      - Group clusters by ETO field, then map field → UCSD canonical discipline.\n",
    "      - Within each domain, sort clusters by article count (size) descending.\n",
    "      - Take the top-N cluster titles as the domain's subdisciplines.\n",
    "      - Seed terms from cluster title + summary tokens.\n",
    "    \"\"\"\n",
    "    if not clusters:\n",
    "        return None\n",
    "\n",
    "    LABEL_STOP = {\"the\",\"and\",\"of\",\"in\",\"on\",\"for\",\"to\",\"with\",\"by\",\"from\",\"an\",\"a\",\n",
    "                  \"or\",\"general\",\"other\",\"sciences\",\"science\",\"research\",\"studies\",\n",
    "                  \"based\",\"using\",\"via\",\"application\",\"applications\",\"analysis\"}\n",
    "\n",
    "    by_domain = defaultdict(list)\n",
    "    for c in clusters:\n",
    "        field = str(c.get(\"field\") or \"\").lower().strip()\n",
    "        if not field:\n",
    "            continue\n",
    "        ucsd = ETO_FIELD_TO_UCSD.get(field)\n",
    "        if not ucsd:\n",
    "            # Surface unmapped fields rather than silently dropping\n",
    "            ucsd = f\"ETO:{field.title()}\"\n",
    "        by_domain[ucsd].append(c)\n",
    "\n",
    "    if not by_domain:\n",
    "        return None\n",
    "\n",
    "    backbone = {}\n",
    "    for ucsd, members in by_domain.items():\n",
    "        members_sorted = sorted(members,\n",
    "                                key=lambda c: -float(c.get(\"size\") or 0))\n",
    "        subs = []\n",
    "        seen = set()\n",
    "        terms = set()\n",
    "        for c in members_sorted[:N_SUBS_PER_DOMAIN]:\n",
    "            title = str(c.get(\"title\") or \"\").strip()\n",
    "            if title and title.lower() not in seen:\n",
    "                subs.append(title)\n",
    "                seen.add(title.lower())\n",
    "            text = (str(c.get(\"title\") or \"\") + \" \" +\n",
    "                    str(c.get(\"summary\") or \"\")).lower()\n",
    "            for w in re.findall(r\"[a-z][a-z\\-]+\", text):\n",
    "                if len(w) > 3 and w not in LABEL_STOP:\n",
    "                    terms.add(w)\n",
    "        backbone[ucsd] = {\n",
    "            \"subdisciplines\": subs or [\"General\"],\n",
    "            \"terms\":          sorted(terms)[:N_TERMS_PER_DOMAIN] or [ucsd.lower()],\n",
    "            \"_source\":        (f\"ETO Map of Science ({len(members)} clusters; \"\n",
    "                               f\"primary field='{members[0].get('field')}')\"),\n",
    "            \"_n_clusters\":    len(members),\n",
    "        }\n",
    "    return backbone\n",
    "\n",
    "\n",
    "layer_A = build_layer_A_backbone(eto_clusters) if eto_clusters else None\n",
    "if layer_A:\n",
    "    print(f\"✓ Layer A built: {len(layer_A)} domains from ETO\")\n",
    "    for d, node in sorted(layer_A.items(),\n",
    "                          key=lambda x: -x[1].get(\"_n_clusters\", 0)):\n",
    "        print(f\"   · {d:55s} ({node['_n_clusters']:>4} clusters, \"\n",
    "              f\"{len(node['subdisciplines']):>2} subs, \"\n",
    "              f\"{len(node['terms']):>3} terms)\")\n",
    "else:\n",
    "    print(\"(Layer A: empty — no clusters mapped)\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2f4329c",
   "metadata": {},
   "source": [
    "REPLACED\n",
    "# Canonical column name candidates (lower-case, comparison ignores spaces/_/-)\n",
    "COL_CANDIDATES = {\n",
    "    \"cluster_id\":      [\"clusterid\", \"id\", \"cluster\"],\n",
    "    \"title\":           [\"clustertitle\", \"title\", \"name\"],\n",
    "    \"summary\":         [\"clustersummary\", \"summary\", \"description\"],\n",
    "    \"disciplines\":     [\"disciplines\", \"discipline\", \"researchdisciplines\"],\n",
    "    \"fields\":          [\"fields\", \"field\", \"researchfields\"],\n",
    "    \"subfields\":       [\"subfields\", \"subfield\"],\n",
    "    \"topics\":          [\"topics\", \"topic\"],\n",
    "    \"key_concepts\":    [\"keyconcepts\", \"concepts\", \"keywords\"],\n",
    "    \"map_x\":           [\"mapx\", \"x\", \"xcoord\"],\n",
    "    \"map_y\":           [\"mapy\", \"y\", \"ycoord\"],\n",
    "    \"size\":            [\"size\", \"narticles\", \"articlecount\"],\n",
    "}\n",
    "\n",
    "def _norm(s):\n",
    "    return re.sub(r\"[\\s_\\-]+\", \"\", str(s).strip().lower())\n",
    "\n",
    "\n",
    "def resolve_columns(eto_df):\n",
    "    '''Return a dict mapping canonical names → actual column names in eto_df.'''\n",
    "    avail = {_norm(c): c for c in eto_df.columns}\n",
    "    resolved = {}\n",
    "    for canon, candidates in COL_CANDIDATES.items():\n",
    "        for c in candidates:\n",
    "            if c in avail:\n",
    "                resolved[canon] = avail[c]\n",
    "                break\n",
    "    return resolved\n",
    "\n",
    "\n",
    "def _split_multivalue(cell):\n",
    "    '''ETO often stores multi-valued columns as 'A | B | C' or 'A; B; C' or JSON list.'''\n",
    "    if cell is None or (isinstance(cell, float) and np.isnan(cell)):\n",
    "        return []\n",
    "    s = str(cell).strip()\n",
    "    if not s or s.lower() in {\"nan\", \"none\", \"[]\"}:\n",
    "        return []\n",
    "    # Try JSON list\n",
    "    if s.startswith(\"[\") and s.endswith(\"]\"):\n",
    "        try:\n",
    "            parsed = json.loads(s)\n",
    "            if isinstance(parsed, list):\n",
    "                return [str(x).strip() for x in parsed if str(x).strip()]\n",
    "        except Exception:\n",
    "            pass\n",
    "    # Fall back to delimiter split\n",
    "    parts = re.split(r\"\\s*[\\|;]\\s*\", s)\n",
    "    if len(parts) == 1:\n",
    "        # Try comma when there's no pipe / semicolon\n",
    "        parts = re.split(r\"\\s*,\\s*\", s)\n",
    "    return [p.strip() for p in parts if p.strip()]\n",
    "\n",
    "\n",
    "eto_clusters = None\n",
    "if ETO_CSV_PATH and Path(ETO_CSV_PATH).exists():\n",
    "    try:\n",
    "        eto_raw = pd.read_csv(ETO_CSV_PATH)\n",
    "        cols = resolve_columns(eto_raw)\n",
    "        print(f\"  Loaded {len(eto_raw):,} rows from {ETO_CSV_PATH}\")\n",
    "        print(f\"  Resolved columns: {cols}\")\n",
    "        # Normalize: every row becomes a dict with canonical keys\n",
    "        recs = []\n",
    "        for _, row in eto_raw.iterrows():\n",
    "            rec = {\"raw_row_index\": _}\n",
    "            for canon, actual in cols.items():\n",
    "                val = row[actual]\n",
    "                if canon in (\"disciplines\",\"fields\",\"subfields\",\"topics\",\"key_concepts\"):\n",
    "                    rec[canon] = _split_multivalue(val)\n",
    "                else:\n",
    "                    rec[canon] = val if pd.notna(val) else None\n",
    "            recs.append(rec)\n",
    "        eto_clusters = recs\n",
    "        print(f\"  ✓ Normalized {len(eto_clusters):,} ETO clusters\")\n",
    "    except Exception as e:\n",
    "        print(f\"  ✗ Failed to load ETO CSV: {type(e).__name__}: {e}\")\n",
    "elif ETO_CSV_PATH:\n",
    "    print(f\"  ⚠ ETO_CSV_PATH set but file not found: {ETO_CSV_PATH}\")\n",
    "else:\n",
    "    print(\"  (ETO_CSV_PATH not set — skipping Layer A; using Layer B alone)\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2631104",
   "metadata": {},
   "source": [
    "### 3.2 Build the Layer-A backbone from ETO clusters\n",
    "\n",
    "ETO discipline → subdiscipline mapping is built bottom-up: each unique\n",
    "`discipline` value becomes a domain, the `fields` it appears with become\n",
    "subdisciplines, and the union of `key_concepts` + `topics` becomes the\n",
    "keyword/term set. Coordinates are dropped (the main notebook computes its\n",
    "own spring layout from the dict structure).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5afc3040",
   "metadata": {},
   "source": [
    "REPLACED\n",
    "def build_layer_A_backbone(clusters):\n",
    "    '''Return a dict matching the main notebook's SCIENCE_BACKBONE shape:\n",
    "        {domain_name: {\"subdisciplines\": [...], \"terms\": [...]}}\n",
    "    '''\n",
    "    if not clusters:\n",
    "        return None\n",
    "\n",
    "    domain_subs = {}    # domain → set of subdisciplines (fields)\n",
    "    domain_terms = {}   # domain → set of terms (key_concepts + topics)\n",
    "\n",
    "    for c in clusters:\n",
    "        disciplines = c.get(\"disciplines\") or []\n",
    "        fields      = c.get(\"fields\") or []\n",
    "        topics      = c.get(\"topics\") or []\n",
    "        concepts    = c.get(\"key_concepts\") or []\n",
    "        title       = c.get(\"title\")\n",
    "\n",
    "        if not disciplines:\n",
    "            continue\n",
    "        # Each cluster usually has its top 3 disciplines & top 3 fields.\n",
    "        # We use the *primary* discipline as the parent.\n",
    "        primary_d = disciplines[0]\n",
    "        domain_subs.setdefault(primary_d, set())\n",
    "        domain_terms.setdefault(primary_d, set())\n",
    "\n",
    "        for f in fields:\n",
    "            domain_subs[primary_d].add(f)\n",
    "        for t in topics + concepts:\n",
    "            if 2 <= len(t) <= 60:\n",
    "                domain_terms[primary_d].add(t.lower())\n",
    "        # Cluster titles are often informative short labels — include them as terms\n",
    "        if title and 2 <= len(str(title)) <= 80:\n",
    "            domain_terms[primary_d].add(str(title).lower())\n",
    "\n",
    "    backbone = {}\n",
    "    for d, subs in domain_subs.items():\n",
    "        terms = sorted(domain_terms.get(d, set()))\n",
    "        backbone[d] = {\n",
    "            \"subdisciplines\": sorted(subs)[:12] or [\"General\"],\n",
    "            \"terms\":          terms[:50] or [d.lower()],\n",
    "            \"_source\":        \"ETO Map of Science cluster CSV export\",\n",
    "            \"_n_clusters\":    sum(1 for c in clusters\n",
    "                                  if (c.get(\"disciplines\") or [None])[0] == d),\n",
    "        }\n",
    "    return backbone\n",
    "\n",
    "\n",
    "layer_A = build_layer_A_backbone(eto_clusters) if eto_clusters else None\n",
    "if layer_A:\n",
    "    print(f\"✓ Layer A built: {len(layer_A)} domains from ETO\")\n",
    "    for d, node in layer_A.items():\n",
    "        print(f\"   · {d:40s} ({node['_n_clusters']} clusters, \"\n",
    "              f\"{len(node['subdisciplines'])} subdisciplines, \"\n",
    "              f\"{len(node['terms'])} terms)\")\n",
    "else:\n",
    "    print(\"(Layer A: empty — ETO CSV not loaded)\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed7d4bc7",
   "metadata": {},
   "source": [
    "## 4. UCSD canonical 13-discipline scaffold (Layer B)\n",
    "\n",
    "The 2010 UCSD Map of Science classifies ~25,000 journals into **554\n",
    "subdisciplines** aggregated into **13 disciplines** [Börner et al. 2012].\n",
    "Each subdiscipline has its own keyword set and x/y coordinates from the\n",
    "spherical layout. The full 554-subdiscipline classification ships with the\n",
    "paper's supplementary data (CC BY-NC-SA 3.0) and is also redistributed as a\n",
    "Pajek `.net` file by the Science Integrity Alliance replication repo.\n",
    "\n",
    "Here we hardcode:\n",
    "- the **13 canonical discipline names** with their published colors, and\n",
    "- a **representative set of subdisciplines per discipline** sufficient to\n",
    "  anchor cross-disciplinary work (especially for subsidence, geosciences,\n",
    "  remote sensing, water resources, infrastructure, policy).\n",
    "\n",
    "The seed terms per subdiscipline are deliberately compact — extend them as\n",
    "your corpus reveals new vocabulary.\n",
    "\n",
    "> **Citation requirement.** UCSD Map of Science is CC BY-NC-SA 3.0. When you\n",
    "> use this scaffold downstream, cite Börner et al. 2012 (PLOS ONE) and include\n",
    "> the standard attribution to The Regents of the University of California,\n",
    "> SciTech Strategies, Observatoire des Sciences et des Technologies, and the\n",
    "> Cyberinfrastructure for Network Science Center.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "d3065142",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✓ Layer B built: 13 disciplines (UCSD canonical)\n",
      "   · Math & Physics                                      8 subs ·  15 terms · #7570B3\n",
      "   · Chemistry                                           6 subs ·  11 terms · #1F78B4\n",
      "   · Earth Sciences                                     11 subs ·  34 terms · #B15928\n",
      "   · Biology                                             7 subs ·  14 terms · #33A02C\n",
      "   · Biotechnology                                       6 subs ·  10 terms · #FB9A99\n",
      "   · Infectious Disease                                  5 subs ·  10 terms · #A6CEE3\n",
      "   · Medical Specialties                                 6 subs ·  10 terms · #E31A1C\n",
      "   · Health Professionals                                5 subs ·   8 terms · #FDBF6F\n",
      "   · Brain Research                                      4 subs ·   9 terms · #FF7F00\n",
      "   · Electrical Engineering & Computer Science           7 subs ·  15 terms · #CAB2D6\n",
      "   · Chemical, Mechanical, & Civil Engineering           8 subs ·  21 terms · #6A3D9A\n",
      "   · Social Sciences                                     7 subs ·  15 terms · #FFFF99\n",
      "   · Humanities                                          6 subs ·   8 terms · #FFD92F\n"
     ]
    }
   ],
   "source": [
    "# ── UCSD 13-discipline canonical scaffold ────────────────────────────\n",
    "# Discipline names and color palette per published topical visualizations\n",
    "# (CNS-IU, VIVO) of the 2010 UCSD map.\n",
    "#\n",
    "# Each discipline ships with a representative set of subdisciplines + seed\n",
    "# terms. The full 554-subdiscipline list lives in the published supplement;\n",
    "# extend below as your corpus calls for it.\n",
    "\n",
    "UCSD_SCAFFOLD = {\n",
    "    \"Math & Physics\": {\n",
    "        \"color\": \"#7570B3\",\n",
    "        \"subdisciplines\": [\"Mathematics\", \"Applied Mathematics\", \"Statistics\",\n",
    "                           \"Theoretical Physics\", \"Condensed Matter Physics\",\n",
    "                           \"Astrophysics\", \"Geodesy\", \"Geophysics\"],\n",
    "        \"terms\": [\"mathematics\",\"statistics\",\"probability\",\"physics\",\"quantum\",\n",
    "                  \"relativity\",\"astrophysics\",\"cosmology\",\"geodesy\",\"gravity\",\n",
    "                  \"geophysics\",\"seismic\",\"numerical\",\"tensor\",\"equation\"],\n",
    "    },\n",
    "    \"Chemistry\": {\n",
    "        \"color\": \"#1F78B4\",\n",
    "        \"subdisciplines\": [\"Analytical Chemistry\", \"Organic Chemistry\",\n",
    "                           \"Inorganic Chemistry\", \"Physical Chemistry\",\n",
    "                           \"Polymer Chemistry\", \"Geochemistry\"],\n",
    "        \"terms\": [\"chemistry\",\"chemical\",\"organic\",\"inorganic\",\"catalysis\",\n",
    "                  \"polymer\",\"molecule\",\"reaction\",\"spectroscopy\",\"geochemistry\",\n",
    "                  \"isotope\"],\n",
    "    },\n",
    "    \"Earth Sciences\": {\n",
    "        \"color\": \"#B15928\",\n",
    "        \"subdisciplines\": [\"Hydrology\", \"Hydrogeology\", \"Geomorphology\",\n",
    "                           \"Sedimentology\", \"Tectonics\", \"Climatology\",\n",
    "                           \"Oceanography\", \"Soil Science\", \"Remote Sensing\",\n",
    "                           \"Land Subsidence\", \"Coastal Science\"],\n",
    "        \"terms\": [\"earth\",\"geology\",\"geological\",\"hydrology\",\"groundwater\",\n",
    "                  \"aquifer\",\"sediment\",\"sedimentation\",\"tectonic\",\"fault\",\n",
    "                  \"subsidence\",\"compaction\",\"consolidation\",\"coastal\",\"delta\",\n",
    "                  \"shoreline\",\"sea level\",\"sea-level\",\"insar\",\"interferometric\",\n",
    "                  \"remote sensing\",\"sentinel\",\"gps\",\"gnss\",\"lidar\",\"leveling\",\n",
    "                  \"extensometer\",\"soil\",\"watershed\",\"climate\",\"precipitation\",\n",
    "                  \"drought\",\"oceanography\",\"seismic\"],\n",
    "    },\n",
    "    \"Biology\": {\n",
    "        \"color\": \"#33A02C\",\n",
    "        \"subdisciplines\": [\"Ecology\", \"Botany\", \"Zoology\",\n",
    "                           \"Evolutionary Biology\", \"Marine Biology\",\n",
    "                           \"Microbiology\", \"Plant Biology\"],\n",
    "        \"terms\": [\"biology\",\"ecology\",\"ecosystem\",\"habitat\",\"species\",\n",
    "                  \"biodiversity\",\"wetland\",\"marsh\",\"mangrove\",\"plant\",\n",
    "                  \"vegetation\",\"microbial\",\"organism\",\"population\"],\n",
    "    },\n",
    "    \"Biotechnology\": {\n",
    "        \"color\": \"#FB9A99\",\n",
    "        \"subdisciplines\": [\"Genetics\", \"Genomics\", \"Molecular Biology\",\n",
    "                           \"Biochemistry\", \"Bioengineering\", \"Synthetic Biology\"],\n",
    "        \"terms\": [\"biotechnology\",\"gene\",\"genetic\",\"genome\",\"protein\",\"enzyme\",\n",
    "                  \"molecular\",\"biochemistry\",\"bioengineering\",\"synthetic\"],\n",
    "    },\n",
    "    \"Infectious Disease\": {\n",
    "        \"color\": \"#A6CEE3\",\n",
    "        \"subdisciplines\": [\"Virology\", \"Immunology\", \"Parasitology\",\n",
    "                           \"Epidemiology\", \"Public Health Microbiology\"],\n",
    "        \"terms\": [\"infection\",\"virus\",\"viral\",\"immunology\",\"bacteria\",\n",
    "                  \"pathogen\",\"epidemic\",\"epidemiology\",\"outbreak\",\"vaccine\"],\n",
    "    },\n",
    "    \"Medical Specialties\": {\n",
    "        \"color\": \"#E31A1C\",\n",
    "        \"subdisciplines\": [\"Cardiology\", \"Oncology\", \"Neurology\", \"Surgery\",\n",
    "                           \"Radiology\", \"Internal Medicine\"],\n",
    "        \"terms\": [\"clinical\",\"patient\",\"medicine\",\"cardiology\",\"cancer\",\n",
    "                  \"oncology\",\"surgery\",\"radiology\",\"diagnosis\",\"treatment\"],\n",
    "    },\n",
    "    \"Health Professionals\": {\n",
    "        \"color\": \"#FDBF6F\",\n",
    "        \"subdisciplines\": [\"Public Health\", \"Nursing\", \"Health Services Research\",\n",
    "                           \"Health Policy\", \"Environmental Health\"],\n",
    "        \"terms\": [\"public health\",\"nursing\",\"health services\",\"health policy\",\n",
    "                  \"environmental health\",\"occupational\",\"wellbeing\",\"care\"],\n",
    "    },\n",
    "    \"Brain Research\": {\n",
    "        \"color\": \"#FF7F00\",\n",
    "        \"subdisciplines\": [\"Neuroscience\", \"Cognitive Science\", \"Psychiatry\",\n",
    "                           \"Behavioral Neuroscience\"],\n",
    "        \"terms\": [\"brain\",\"neural\",\"neuron\",\"cognition\",\"cognitive\",\"memory\",\n",
    "                  \"behavior\",\"psychiatry\",\"neuroscience\"],\n",
    "    },\n",
    "    \"Electrical Engineering & Computer Science\": {\n",
    "        \"color\": \"#CAB2D6\",\n",
    "        \"subdisciplines\": [\"Computer Science\", \"Machine Learning\",\n",
    "                           \"Artificial Intelligence\", \"Signal Processing\",\n",
    "                           \"Electrical Engineering\", \"Computer Vision\",\n",
    "                           \"Networks & Distributed Systems\"],\n",
    "        \"terms\": [\"algorithm\",\"computer\",\"computing\",\"machine learning\",\n",
    "                  \"deep learning\",\"neural network\",\"convolutional\",\"lstm\",\n",
    "                  \"signal processing\",\"electrical\",\"gis\",\"spatial analysis\",\n",
    "                  \"computer vision\",\"data science\",\"artificial intelligence\"],\n",
    "    },\n",
    "    \"Chemical, Mechanical, & Civil Engineering\": {\n",
    "        \"color\": \"#6A3D9A\",\n",
    "        \"subdisciplines\": [\"Civil Engineering\", \"Geotechnical Engineering\",\n",
    "                           \"Hydraulic Engineering\", \"Structural Engineering\",\n",
    "                           \"Mechanical Engineering\", \"Chemical Engineering\",\n",
    "                           \"Petroleum Engineering\", \"Mining Engineering\"],\n",
    "        \"terms\": [\"civil engineering\",\"geotechnical\",\"hydraulic\",\"structural\",\n",
    "                  \"foundation\",\"soil mechanics\",\"mechanical\",\"chemical engineering\",\n",
    "                  \"petroleum\",\"reservoir\",\"oil\",\"gas\",\"co2 storage\",\"sequestration\",\n",
    "                  \"mining\",\"coal\",\"longwall\",\"material\",\"stress\",\"strain\",\n",
    "                  \"finite element\"],\n",
    "    },\n",
    "    \"Social Sciences\": {\n",
    "        \"color\": \"#FFFF99\",\n",
    "        \"subdisciplines\": [\"Economics\", \"Political Science\", \"Sociology\",\n",
    "                           \"Geography\", \"Anthropology\", \"Public Policy\",\n",
    "                           \"Decision Science\"],\n",
    "        \"terms\": [\"economics\",\"economic\",\"political\",\"policy\",\"governance\",\n",
    "                  \"regulation\",\"stakeholder\",\"decision support\",\"management\",\n",
    "                  \"sociology\",\"geography\",\"anthropology\",\"community\",\n",
    "                  \"groundwater management\",\"gma\"],\n",
    "    },\n",
    "    \"Humanities\": {\n",
    "        \"color\": \"#FFD92F\",\n",
    "        \"subdisciplines\": [\"History\", \"Philosophy\", \"Literature\", \"Linguistics\",\n",
    "                           \"Ethics\", \"Science & Technology Studies\"],\n",
    "        \"terms\": [\"history\",\"philosophy\",\"ethics\",\"literature\",\"linguistics\",\n",
    "                  \"discourse\",\"narrative\",\"cultural\"],\n",
    "    },\n",
    "}\n",
    "\n",
    "# Compact UCSD reference frame as a backbone dict\n",
    "def materialize_UCSD(scaffold):\n",
    "    backbone = {}\n",
    "    for d, node in scaffold.items():\n",
    "        backbone[d] = {\n",
    "            \"subdisciplines\": list(node[\"subdisciplines\"]),\n",
    "            \"terms\":          list(node[\"terms\"]),\n",
    "            \"color\":          node.get(\"color\"),\n",
    "            \"_source\":        \"UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-SA 3.0)\",\n",
    "        }\n",
    "    return backbone\n",
    "\n",
    "layer_B = materialize_UCSD(UCSD_SCAFFOLD)\n",
    "print(f\"✓ Layer B built: {len(layer_B)} disciplines (UCSD canonical)\")\n",
    "for d, node in layer_B.items():\n",
    "    print(f\"   · {d:50s} {len(node['subdisciplines']):2d} subs · \"\n",
    "          f\"{len(node['terms']):3d} terms · {node['color']}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6146b664",
   "metadata": {},
   "source": [
    "### 4.1 Optional: load the full 554-subdiscipline UCSD classification\n",
    "\n",
    "The compact scaffold above is one curated subdiscipline-per-row per discipline.\n",
    "If you want the **full grain** — every one of the 554 published subdisciplines\n",
    "with its x/y coordinates from the spherical layout — this cell fetches the\n",
    "Pajek `.net` file from the Science Integrity Alliance replication repository\n",
    "[5] (which redistributes the UCSD classification under CC BY-NC-SA 3.0) and\n",
    "replaces `layer_B` with the full-detail version.\n",
    "\n",
    "The parser groups subdisciplines under their parent discipline using the\n",
    "`ic` (interior color) attribute, which is the discipline tag in the Pajek\n",
    "file. Each subdiscipline's term list is seeded from its own label (e.g.\n",
    "`\"Clinical Cancer Research\"` → `[\"clinical\", \"cancer\", \"research\", \"clinical cancer\", ...]`)\n",
    "so the full set is immediately usable for the downstream keyword-matcher in\n",
    "the main notebook. Extend the term lists further as your corpus surfaces new\n",
    "vocabulary.\n",
    "\n",
    "The cell **degrades gracefully**: if the network fetch fails (firewall,\n",
    "offline, etc.) or the file isn't already cached locally, it leaves the\n",
    "compact scaffold from §4 in place and prints a clear \"skipped\" message.\n",
    "\n",
    "**[5]** Science Integrity Alliance, *science-map* (GitHub).\n",
    "https://github.com/Science-Integrity-Alliance/science-map · CC BY-NC-SA 3.0,\n",
    "redistributing the 2010 UCSD Map of Science classification (Börner et al. 2012).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "02bc393c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✓ Pajek source: /scratch/01813/sawp33/tapis/b673a5f6-e828-4ff9-add8-fd125f967b4a-007/work/dso_cookbook_fixes/ScienceBackboneResults/UCSDmap_with_disciplines.net.txt  (103.0 KB)\n",
      "✓ Parsed 567 vertices (expected ≈ 567 = 554 subdisciplines + 13 disciplines)\n",
      "  ⚠ 24 vertices had a color that didn't match any discipline anchor; they were dropped.\n",
      "✓ Layer B upgraded: 12 disciplines × 531 subdisciplines (531 assigned)\n",
      "   · Biology                                             43 subs ·   88 terms · OliveGreen\n",
      "   · Biotechnology                                       11 subs ·   33 terms · Emerald\n",
      "   · Medical Specialties                                 69 subs ·  128 terms · Red\n",
      "   · Chemical, Mechanical, & Civil Engineering           76 subs ·  186 terms · SkyBlue\n",
      "   · Chemistry                                           32 subs ·   73 terms · Blue\n",
      "   · Earth Sciences                                      22 subs ·   72 terms · Mahogany\n",
      "   · Electrical Engineering & Computer Science           57 subs ·  142 terms · Lavender\n",
      "   · Brain Research                                      28 subs ·   63 terms · Dandelion\n",
      "   · Humanities                                          26 subs ·   56 terms · Canary\n",
      "   · Math & Physics                                      28 subs ·   78 terms · Mulberry\n",
      "   · Health Professionals                                70 subs ·  152 terms · Peach\n",
      "   · Social Sciences                                     69 subs ·  156 terms · Yellow\n"
     ]
    }
   ],
   "source": [
    "# ── Switch this off to keep the compact scaffold from §4 ─────────────\n",
    "USE_UCSD_FULL = True\n",
    "\n",
    "# Pajek .net file: 567 vertices (554 subdisciplines + 13 disciplines)\n",
    "UCSD_NET_URL    = (\"https://raw.githubusercontent.com/Science-Integrity-Alliance/\"\n",
    "                   \"science-map/main/UCSDmap_with_disciplines.net.txt\")\n",
    "UCSD_NET_LOCAL  = OUTPUT_DIR / \"UCSDmap_with_disciplines.net.txt\"\n",
    "\n",
    "# Canonical discipline labels we expect to find as named nodes in the .net\n",
    "UCSD_DISCIPLINE_NAMES = {\n",
    "    \"Math & Physics\", \"Chemistry\", \"Earth Sciences\", \"Biology\",\n",
    "    \"Biotechnology\", \"Infectious Disease\", \"Medical Specialties\",\n",
    "    \"Health Professionals\", \"Brain Research\",\n",
    "    \"Electrical Engineering & Computer Science\",\n",
    "    \"Chemical, Mechanical, & Civil Engineering\",\n",
    "    \"Social Sciences\", \"Humanities\",\n",
    "    # Known minor variants seen in some versions of the file:\n",
    "    \"Math and Physics\", \"Medical specialties\", \"Health professionals\",\n",
    "}\n",
    "\n",
    "_VERTEX_RE = re.compile(\n",
    "    r'^(\\d+)\\s+\"([^\"]+)\"\\s+([\\d.eE+\\-]+)\\s+([\\d.eE+\\-]+)'\n",
    "    r'(?:\\s+x_fact\\s+([\\d.eE+\\-]+))?'\n",
    "    r'(?:\\s+y_fact\\s+([\\d.eE+\\-]+))?'\n",
    "    r'(?:\\s+ic\\s+(\\S+))?'\n",
    "    r'(?:\\s+bc\\s+(\\S+))?'\n",
    ")\n",
    "\n",
    "\n",
    "def fetch_ucsd_net(url=UCSD_NET_URL, dest=UCSD_NET_LOCAL, timeout=30):\n",
    "    '''Cache to dest; only re-download if dest is missing or trivially small.'''\n",
    "    # Real file is ~103 KB; 5 KB threshold catches truncated downloads while\n",
    "    # allowing legitimate hand-curated subsets (or this notebook's smoke tests).\n",
    "    if dest.exists() and dest.stat().st_size > 5_000:\n",
    "        return dest\n",
    "    import urllib.request\n",
    "    req = urllib.request.Request(url, headers={\"User-Agent\": \"subside-backbone-setup/1.0\"})\n",
    "    with urllib.request.urlopen(req, timeout=timeout) as resp:\n",
    "        data = resp.read()\n",
    "    dest.write_bytes(data)\n",
    "    return dest\n",
    "\n",
    "\n",
    "def parse_ucsd_net(path):\n",
    "    '''Parse the UCSD Pajek .net file. Returns list of vertex dicts.'''\n",
    "    vertices = []\n",
    "    section = None\n",
    "    with open(path, \"r\", encoding=\"utf-8\", errors=\"replace\") as fh:\n",
    "        for raw_line in fh:\n",
    "            line = raw_line.strip()\n",
    "            if not line:\n",
    "                continue\n",
    "            lower = line.lower()\n",
    "            if lower.startswith(\"*vertices\"):\n",
    "                section = \"vertices\"; continue\n",
    "            if lower.startswith(\"*edges\") or lower.startswith(\"*arcs\"):\n",
    "                section = \"edges\"; continue\n",
    "            if section != \"vertices\":\n",
    "                continue\n",
    "            m = _VERTEX_RE.match(line)\n",
    "            if not m:\n",
    "                continue\n",
    "            vertices.append({\n",
    "                \"id\":     int(m.group(1)),\n",
    "                \"label\":  m.group(2),\n",
    "                \"x\":      float(m.group(3)),\n",
    "                \"y\":      float(m.group(4)),\n",
    "                \"x_fact\": float(m.group(5)) if m.group(5) else None,\n",
    "                \"y_fact\": float(m.group(6)) if m.group(6) else None,\n",
    "                \"color\":  m.group(7) or \"\",\n",
    "            })\n",
    "    return vertices\n",
    "\n",
    "\n",
    "# Reusable stopwords for seeding subdiscipline term lists from their labels\n",
    "_LABEL_STOP = {\"the\",\"and\",\"of\",\"in\",\"on\",\"for\",\"to\",\"with\",\"by\",\"from\",\"an\",\"a\",\n",
    "               \"or\",\"general\",\"other\",\"sciences\",\"science\",\"research\",\"studies\"}\n",
    "\n",
    "\n",
    "def build_ucsd_full_backbone(vertices, scaffold_palette=UCSD_SCAFFOLD):\n",
    "    '''Bucket subdisciplines under their parent discipline via the ic color tag.\n",
    "\n",
    "    Strategy:\n",
    "      1. Find the 13 discipline-named vertices; record (color → discipline_name).\n",
    "      2. Group every remaining vertex by its color → its parent discipline.\n",
    "      3. Seed each subdiscipline's term list from its label tokens.\n",
    "      4. Inherit the publication-attested hex color from UCSD_SCAFFOLD when\n",
    "         the Pajek color string is a plain name (Red/Blue/Green/…).\n",
    "    '''\n",
    "    # Step 1: discover discipline → color\n",
    "    discipline_color = {}     # discipline_name → pajek_color\n",
    "    discipline_coords = {}    # discipline_name → (x, y)\n",
    "    for v in vertices:\n",
    "        if v[\"label\"] in UCSD_DISCIPLINE_NAMES:\n",
    "            # Normalize minor case/spelling variants to canonical names\n",
    "            canon = v[\"label\"]\n",
    "            for d in scaffold_palette:\n",
    "                if d.lower().replace(\" \", \"\") == canon.lower().replace(\" \", \"\"):\n",
    "                    canon = d\n",
    "                    break\n",
    "            discipline_color[canon] = v[\"color\"]\n",
    "            discipline_coords[canon] = (v[\"x\"], v[\"y\"])\n",
    "\n",
    "    if not discipline_color:\n",
    "        raise RuntimeError(\"No discipline-named anchor nodes found in .net file. \"\n",
    "                           \"The file format may have changed.\")\n",
    "\n",
    "    color_to_discipline = {c: d for d, c in discipline_color.items()}\n",
    "\n",
    "    # Step 2: initialize empty backbone with metadata inherited from §4 scaffold\n",
    "    backbone = {}\n",
    "    for d, pajek_color in discipline_color.items():\n",
    "        scaffold_node = scaffold_palette.get(d, {})\n",
    "        hex_color = scaffold_node.get(\"color\")    # publication hex\n",
    "        backbone[d] = {\n",
    "            \"subdisciplines\":      [],\n",
    "            \"terms\":               list(scaffold_node.get(\"terms\", [])),\n",
    "            \"color\":               hex_color or pajek_color,\n",
    "            \"pajek_color\":         pajek_color,\n",
    "            \"coords\":              {\"x\": discipline_coords[d][0],\n",
    "                                    \"y\": discipline_coords[d][1]},\n",
    "            \"subdiscipline_coords\": {},\n",
    "            \"_source\": (\"UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-SA 3.0), \"\n",
    "                        \"full 554-subdiscipline classification via Science Integrity Alliance\"),\n",
    "        }\n",
    "\n",
    "    # Step 3: bucket non-discipline vertices\n",
    "    n_assigned, n_orphan = 0, 0\n",
    "    for v in vertices:\n",
    "        if v[\"label\"] in UCSD_DISCIPLINE_NAMES:\n",
    "            continue\n",
    "        parent = color_to_discipline.get(v[\"color\"])\n",
    "        if not parent:\n",
    "            n_orphan += 1\n",
    "            continue\n",
    "        n_assigned += 1\n",
    "        sub = v[\"label\"]\n",
    "        backbone[parent][\"subdisciplines\"].append(sub)\n",
    "        backbone[parent][\"subdiscipline_coords\"][sub] = {\"x\": v[\"x\"], \"y\": v[\"y\"]}\n",
    "        # Seed terms from the subdiscipline label\n",
    "        words = re.findall(r\"[A-Za-z][A-Za-z\\-]+\", sub.lower())\n",
    "        terms = backbone[parent][\"terms\"]\n",
    "        existing = set(terms)\n",
    "        for w in words:\n",
    "            if len(w) > 2 and w not in _LABEL_STOP and w not in existing:\n",
    "                terms.append(w); existing.add(w)\n",
    "        sub_l = sub.lower()\n",
    "        if sub_l not in existing:\n",
    "            terms.append(sub_l)\n",
    "\n",
    "    # Sort subdisciplines alphabetically within each discipline for readability\n",
    "    for d in backbone:\n",
    "        order = sorted(range(len(backbone[d][\"subdisciplines\"])),\n",
    "                       key=lambda i: backbone[d][\"subdisciplines\"][i].lower())\n",
    "        backbone[d][\"subdisciplines\"] = [backbone[d][\"subdisciplines\"][i] for i in order]\n",
    "\n",
    "    return backbone, n_assigned, n_orphan\n",
    "\n",
    "\n",
    "# Try to use the full classification; fall back to the compact scaffold on failure\n",
    "if USE_UCSD_FULL:\n",
    "    try:\n",
    "        net_path = fetch_ucsd_net()\n",
    "        size_kb = net_path.stat().st_size / 1024\n",
    "        print(f\"✓ Pajek source: {net_path}  ({size_kb:.1f} KB)\")\n",
    "        vertices = parse_ucsd_net(net_path)\n",
    "        print(f\"✓ Parsed {len(vertices)} vertices \"\n",
    "              f\"(expected ≈ 567 = 554 subdisciplines + 13 disciplines)\")\n",
    "        layer_B_full, n_assigned, n_orphan = build_ucsd_full_backbone(vertices)\n",
    "        if n_orphan:\n",
    "            print(f\"  ⚠ {n_orphan} vertices had a color that didn't match any discipline anchor; \"\n",
    "                  f\"they were dropped.\")\n",
    "        # Replace layer B with the full-grain version\n",
    "        layer_B = layer_B_full\n",
    "        n_total_subs = sum(len(node[\"subdisciplines\"]) for node in layer_B.values())\n",
    "        print(f\"✓ Layer B upgraded: {len(layer_B)} disciplines × \"\n",
    "              f\"{n_total_subs} subdisciplines ({n_assigned} assigned)\")\n",
    "        for d, node in layer_B.items():\n",
    "            print(f\"   · {d:50s} {len(node['subdisciplines']):3d} subs · \"\n",
    "                  f\"{len(node['terms']):4d} terms · {node['pajek_color']}\")\n",
    "    except Exception as e:\n",
    "        print(f\"  ⚠ Could not load the full UCSD classification: {type(e).__name__}: {e}\")\n",
    "        print(f\"  Falling back to the compact 13-discipline scaffold from §4.\")\n",
    "        print(f\"  To retry: download\")\n",
    "        print(f\"    {UCSD_NET_URL}\")\n",
    "        print(f\"  manually and save it to {UCSD_NET_LOCAL}, then re-run this cell.\")\n",
    "else:\n",
    "    print(\"USE_UCSD_FULL is False — keeping the compact 13-discipline scaffold from §4.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39c395ba",
   "metadata": {},
   "source": [
    "## 5. Combine the layers\n",
    "\n",
    "Strategy:\n",
    "\n",
    "- **Both Layer A and Layer B present** → produce *both* in the JSON file under\n",
    "  separate keys (`\"eto\"` and `\"ucsd\"`), plus a `\"merged\"` view that overlays\n",
    "  ETO domains/fields on top of the UCSD scaffold's keyword sets. The main\n",
    "  notebook can pick whichever it wants via the `backbone_layer` setting it\n",
    "  exposes after Tier 0.\n",
    "- **Only Layer B present** → produce just `\"ucsd\"` and tag `\"merged\"` to point\n",
    "  at it. This is the most common case until you do the ETO export.\n",
    "- **Only Layer A present** → mirror it into `\"merged\"`.\n",
    "\n",
    "The merge rule is conservative: where Layer A discipline names match a UCSD\n",
    "discipline (case-insensitive contains), we *extend* the UCSD term set with\n",
    "Layer A's `key_concepts`/`topics`. ETO domains that don't match anything in\n",
    "UCSD are added as new top-level entries so nothing is dropped.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "650768e0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✓ Merged: 12 domains\n",
      "   Layer A (ETO)  : 0 domains\n",
      "   Layer B (UCSD) : 12 disciplines\n"
     ]
    }
   ],
   "source": [
    "def merge_layers(layer_A, layer_B):\n",
    "    '''Return a merged dict. Layer B is the scaffold, Layer A extends it.'''\n",
    "    if not layer_B and not layer_A:\n",
    "        return None\n",
    "    if not layer_A:\n",
    "        return {d: dict(node, _source=\"UCSD canonical only\") for d, node in layer_B.items()}\n",
    "    if not layer_B:\n",
    "        return {d: dict(node, _source=\"ETO only\") for d, node in layer_A.items()}\n",
    "\n",
    "    merged = {d: {\"subdisciplines\": list(node[\"subdisciplines\"]),\n",
    "                  \"terms\":          list(node[\"terms\"]),\n",
    "                  \"color\":          node.get(\"color\"),\n",
    "                  \"_source\":        \"UCSD canonical (extended by ETO)\"}\n",
    "              for d, node in layer_B.items()}\n",
    "\n",
    "    # Match ETO domains to UCSD on case-insensitive substring overlap\n",
    "    ucsd_lower = {d.lower(): d for d in merged}\n",
    "    for eto_d, eto_node in layer_A.items():\n",
    "        match = None\n",
    "        eto_lower = eto_d.lower()\n",
    "        # Try exact\n",
    "        if eto_lower in ucsd_lower:\n",
    "            match = ucsd_lower[eto_lower]\n",
    "        else:\n",
    "            # Try token overlap (>= 1 substantive token shared)\n",
    "            eto_tokens = {t for t in re.split(r\"\\W+\", eto_lower) if len(t) > 3}\n",
    "            for u_lower, u_d in ucsd_lower.items():\n",
    "                u_tokens = {t for t in re.split(r\"\\W+\", u_lower) if len(t) > 3}\n",
    "                if eto_tokens & u_tokens:\n",
    "                    match = u_d\n",
    "                    break\n",
    "        if match:\n",
    "            # Extend UCSD discipline\n",
    "            existing_subs  = set(merged[match][\"subdisciplines\"])\n",
    "            existing_terms = set(merged[match][\"terms\"])\n",
    "            for s in eto_node[\"subdisciplines\"]:\n",
    "                if s not in existing_subs:\n",
    "                    merged[match][\"subdisciplines\"].append(s)\n",
    "                    existing_subs.add(s)\n",
    "            for t in eto_node[\"terms\"]:\n",
    "                t_l = t.lower()\n",
    "                if t_l not in existing_terms:\n",
    "                    merged[match][\"terms\"].append(t_l)\n",
    "                    existing_terms.add(t_l)\n",
    "            merged[match][\"_source\"] = \"UCSD canonical + ETO match: \" + eto_d\n",
    "        else:\n",
    "            # Add ETO domain as a new entry\n",
    "            merged[eto_d] = dict(eto_node)\n",
    "            merged[eto_d][\"_source\"] = \"ETO only (no UCSD match)\"\n",
    "\n",
    "    return merged\n",
    "\n",
    "\n",
    "backbone_eto    = layer_A\n",
    "backbone_ucsd   = layer_B\n",
    "backbone_merged = merge_layers(layer_A, layer_B)\n",
    "\n",
    "print(f\"✓ Merged: {len(backbone_merged)} domains\")\n",
    "print(f\"   Layer A (ETO)  : {len(layer_A) if layer_A else 0} domains\")\n",
    "print(f\"   Layer B (UCSD) : {len(layer_B)} disciplines\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "573ffbdd",
   "metadata": {},
   "source": [
    "## 6. Persist the backbone to `science_backbone.json`\n",
    "\n",
    "The main notebook's Tier 0 hook reads this file. The schema is:\n",
    "\n",
    "```json\n",
    "{\n",
    "  \"schema_version\": \"1.0\",\n",
    "  \"produced_at\":    \"<ISO timestamp>\",\n",
    "  \"produced_by\":    \"SUBSIDE_ScienceBackbone_Setup.ipynb\",\n",
    "  \"selected_layer\": \"merged\" | \"ucsd\" | \"eto\",\n",
    "  \"citations\":      [...],\n",
    "  \"layers\": {\n",
    "    \"eto\":    { ... } | null,\n",
    "    \"ucsd\":   { ... },\n",
    "    \"merged\": { ... }\n",
    "  }\n",
    "}\n",
    "```\n",
    "\n",
    "`selected_layer` is the one the main notebook will load by default; you can\n",
    "change it later by editing the JSON in place.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "595079be",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✓ Wrote /scratch/01813/sawp33/tapis/b673a5f6-e828-4ff9-add8-fd125f967b4a-007/work/dso_cookbook_fixes/ScienceBackboneResults/science_backbone.json\n",
      "  Selected layer: ucsd\n",
      "  File size     : 215,581 bytes\n"
     ]
    }
   ],
   "source": [
    "# Default selection: prefer 'merged' when both layers exist, else 'ucsd'\n",
    "SELECTED_LAYER = (\"merged\" if (layer_A and layer_B) else\n",
    "                  \"ucsd\"   if layer_B            else\n",
    "                  \"eto\"    if layer_A            else None)\n",
    "\n",
    "CITATIONS = [\n",
    "    {\n",
    "        \"label\": \"UCSD Map of Science\",\n",
    "        \"citation\": (\"Börner K, Klavans R, Patek M, Zoss AM, Biberstine JR, \"\n",
    "                     \"Light RP, Larivière V, Boyack KW (2012). Design and \"\n",
    "                     \"Update of a Classification System: The UCSD Map of \"\n",
    "                     \"Science. PLOS ONE 7(7): e39464.\"),\n",
    "        \"doi\": \"10.1371/journal.pone.0039464\",\n",
    "        \"license\": \"CC BY-NC-SA 3.0\",\n",
    "        \"attribution_text\": (\"The authors wish to acknowledge The Regents of \"\n",
    "                             \"the University of California, SciTech Strategies, \"\n",
    "                             \"Observatoire des Sciences et des Technologies, \"\n",
    "                             \"and the Cyberinfrastructure for Network Science \"\n",
    "                             \"Center for making the 2010 UCSD Map of Science \"\n",
    "                             \"and Classification System available for this work.\"),\n",
    "    },\n",
    "    {\n",
    "        \"label\": \"Consensus Map of Science\",\n",
    "        \"citation\": (\"Klavans R, Boyack KW (2009). Toward a Consensus Map of \"\n",
    "                     \"Science. JASIST 60(3): 455–476.\"),\n",
    "        \"doi\": \"10.1002/asi.20991\",\n",
    "    },\n",
    "    {\n",
    "        \"label\": \"Mapping Knowledge Domains\",\n",
    "        \"citation\": (\"Shiffrin RM, Börner K (2004). Mapping Knowledge Domains. \"\n",
    "                     \"PNAS 101 (Suppl. 1): 5183–5185.\"),\n",
    "        \"doi\": \"10.1073/pnas.0307852100\",\n",
    "    },\n",
    "    {\n",
    "        \"label\": \"ETO Map of Science\",\n",
    "        \"citation\": (\"Emerging Technology Observatory, Center for Security \"\n",
    "                     \"and Emerging Technology, Georgetown University. \"\n",
    "                     \"ETO Map of Science.\"),\n",
    "        \"url\": \"https://sciencemap.eto.tech/\",\n",
    "        \"methodology_url\": \"https://eto.tech/dataset-docs/mac-clusters/\",\n",
    "    },\n",
    "]\n",
    "\n",
    "payload = {\n",
    "    \"schema_version\": \"1.0\",\n",
    "    \"produced_at\":    datetime.utcnow().isoformat() + \"Z\",\n",
    "    \"produced_by\":    \"SUBSIDE_ScienceBackbone_Setup.ipynb\",\n",
    "    \"bib_source\":     str(BIB_PATH),\n",
    "    \"eto_csv_source\": str(ETO_CSV_PATH) if ETO_CSV_PATH else None,\n",
    "    \"selected_layer\": SELECTED_LAYER,\n",
    "    \"citations\":      CITATIONS,\n",
    "    \"layers\": {\n",
    "        \"eto\":    backbone_eto,\n",
    "        \"ucsd\":   backbone_ucsd,\n",
    "        \"merged\": backbone_merged,\n",
    "    },\n",
    "    \"search_seeds_for_eto_query\":   seeds if 'seeds' in dir() else [],\n",
    "}\n",
    "\n",
    "with open(BACKBONE_JSON_PATH, \"w\") as fh:\n",
    "    json.dump(payload, fh, indent=2)\n",
    "\n",
    "print(f\"✓ Wrote {BACKBONE_JSON_PATH}\")\n",
    "print(f\"  Selected layer: {SELECTED_LAYER}\")\n",
    "print(f\"  File size     : {BACKBONE_JSON_PATH.stat().st_size:,} bytes\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c27644c8",
   "metadata": {},
   "source": [
    "## 7. Diagnostic view — verify before handing off\n",
    "\n",
    "Quick summary of what the main notebook will see. Use this to spot\n",
    "mis-mapped ETO domains, sparse UCSD term sets, or duplicated subdisciplines\n",
    "before running downstream analyses.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "27f7508e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "── Layer B — UCSD canonical ──\n",
      "  Biology                                                 subs=43  terms= 88  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Biotechnology                                           subs=11  terms= 33  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Medical Specialties                                     subs=69  terms=128  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Chemical, Mechanical, & Civil Engineering               subs=76  terms=186  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Chemistry                                               subs=32  terms= 73  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Earth Sciences                                          subs=22  terms= 72  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Electrical Engineering & Computer Science               subs=57  terms=142  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Brain Research                                          subs=28  terms= 63  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Humanities                                              subs=26  terms= 56  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Math & Physics                                          subs=28  terms= 78  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Health Professionals                                    subs=70  terms=152  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "  Social Sciences                                         subs=69  terms=156  [UCSD Map of Science 2010 (Börner et al. 2012, CC BY-NC-]\n",
      "\n",
      "── Layer A — ETO (may be empty) ──\n",
      "  (empty)\n",
      "\n",
      "── MERGED (ucsd will be used by default) ──\n",
      "  Biology                                                 subs=43  terms= 88  [UCSD canonical only]\n",
      "  Biotechnology                                           subs=11  terms= 33  [UCSD canonical only]\n",
      "  Medical Specialties                                     subs=69  terms=128  [UCSD canonical only]\n",
      "  Chemical, Mechanical, & Civil Engineering               subs=76  terms=186  [UCSD canonical only]\n",
      "  Chemistry                                               subs=32  terms= 73  [UCSD canonical only]\n",
      "  Earth Sciences                                          subs=22  terms= 72  [UCSD canonical only]\n",
      "  Electrical Engineering & Computer Science               subs=57  terms=142  [UCSD canonical only]\n",
      "  Brain Research                                          subs=28  terms= 63  [UCSD canonical only]\n",
      "  Humanities                                              subs=26  terms= 56  [UCSD canonical only]\n",
      "  Math & Physics                                          subs=28  terms= 78  [UCSD canonical only]\n",
      "  Health Professionals                                    subs=70  terms=152  [UCSD canonical only]\n",
      "  Social Sciences                                         subs=69  terms=156  [UCSD canonical only]\n"
     ]
    }
   ],
   "source": [
    "def summarize_backbone(backbone, label):\n",
    "    print(f\"\\n── {label} ──\")\n",
    "    if not backbone:\n",
    "        print(\"  (empty)\")\n",
    "        return\n",
    "    for d, node in backbone.items():\n",
    "        n_subs  = len(node.get(\"subdisciplines\", []))\n",
    "        n_terms = len(node.get(\"terms\", []))\n",
    "        src     = node.get(\"_source\", \"\")\n",
    "        print(f\"  {d:55s} subs={n_subs:2d}  terms={n_terms:3d}  [{src[:55]}]\")\n",
    "\n",
    "summarize_backbone(backbone_ucsd,   \"Layer B — UCSD canonical\")\n",
    "summarize_backbone(backbone_eto,    \"Layer A — ETO (may be empty)\")\n",
    "summarize_backbone(backbone_merged, f\"MERGED ({SELECTED_LAYER} will be used by default)\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "194d6884",
   "metadata": {},
   "source": [
    "## 8. Wiring this into `SUBSIDE_Science_Backbone.ipynb`\n",
    "\n",
    "The main notebook already has a **Tier 0** hook (added in v1.1) that loads the\n",
    "JSON this notebook just wrote. To use it:\n",
    "\n",
    "1. **Open** `SUBSIDE_Science_Backbone.ipynb`.\n",
    "2. In §1.1, set:\n",
    "\n",
    "   ```python\n",
    "   SCIENCE_BACKBONE_JSON = \"<absolute path to science_backbone.json>\"\n",
    "   ```\n",
    "\n",
    "   If you've followed the default paths, that's\n",
    "   `ScienceBackboneResults/science_backbone.json` relative to this notebook.\n",
    "3. **Run** the main notebook normally. §5 prints `Tier 0: …` followed by the\n",
    "   selected layer name in its source line.\n",
    "\n",
    "If `SCIENCE_BACKBONE_JSON` is left `None` (the default), the main notebook\n",
    "behaves exactly as it did before — Tier 1 (ETO via sbp) → Tier 2 (sbp default)\n",
    "→ Tier 3 (SUBSIDE inline).\n",
    "\n",
    "### Refreshing the backbone\n",
    "\n",
    "Re-run **this** notebook only when:\n",
    "\n",
    "- You've done a fresh ETO Map of Science CSV export (point `ETO_CSV_PATH` at it).\n",
    "- Your corpus has shifted enough that the §2.1 search seeds change appreciably.\n",
    "- You want to extend the UCSD `_SCAFFOLD` dict with new subdisciplines.\n",
    "\n",
    "The main notebook can be re-run as often as you like without touching this one.\n",
    "\n",
    "### Citing the result\n",
    "\n",
    "When you publish anything derived from this backbone, include:\n",
    "\n",
    "- **Börner et al. 2012** (UCSD Map of Science) — required attribution per the\n",
    "  CC BY-NC-SA 3.0 license. The full text is in the JSON under\n",
    "  `citations[0].attribution_text`.\n",
    "- **ETO Map of Science** — if Layer A was used. The required citation form is\n",
    "  the label `\"ETO Map of Science\"` plus a link to `https://sciencemap.eto.tech/`.\n",
    "- **Klavans & Boyack 2009** — when discussing the consensus-map motivation.\n",
    "- **Shiffrin & Börner 2004** — when framing the broader knowledge-domain\n",
    "  mapping context.\n",
    "\n",
    "All four are pre-formatted in `payload[\"citations\"]` for easy export to your\n",
    "manuscript's reference manager.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "809a4ecd",
   "metadata": {},
   "source": [
    "## What this notebook does, at a glance\n",
    "\n",
    "| Section | Role |\n",
    "|---|---|\n",
    "| 1 | Setup, paths, dependency check |\n",
    "| 2 | Quick BibTeX summary → ETO query seed terms |\n",
    "| 3 | ETO Map of Science walkthrough + CSV ingest → Layer A backbone |\n",
    "| 4 | UCSD canonical 13-discipline scaffold (Börner et al. 2012) → Layer B |\n",
    "| 5 | Merge Layer A + B (UCSD scaffold extended by ETO matches) |\n",
    "| 6 | Persist `science_backbone.json` with citations + provenance |\n",
    "| 7 | Diagnostic summary of all three layers |\n",
    "| 8 | Wiring instructions for the main notebook |\n",
    "\n",
    "This setup notebook is the *referenced and reusable* anchor for cross-disciplinary\n",
    "analyses across SUBSIDE — and for any other corpus you'd like to project onto a\n",
    "science basemap.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1de595b1-e798-453d-8800-e162b601754c",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.20"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
