Chapter 11: ConceptMap-Text — Model Building and Clustering

Chapter 11: ConceptMap-Text — Model Building and Clustering — ThinkNavi User Manual

Chapter 11: ConceptMap-Text — Model Building and Clustering

11.1 Step 4: Feature Settings

A step for adjusting each dimension’s weight (influence). Emphasize specific analysis axes or dampen noisy dimensions.

Operations

Sliders are displayed for each dimension:

ParameterRangeDefaultDescription
Dimension Weight0.0-2.01.0Each dimension’s influence on model building
  • Slider right (>1.0): Emphasize this dimension. Differences along this axis are amplified
  • Slider left (<1.0): De-emphasize this dimension. Differences along this axis are reduced
  • 0.0: Completely ignore this dimension

Steps:

  1. Adjust sliders as needed
  2. Defaults (all 1.0) are usually sufficient
  3. “Reset All” button restores all dimensions to 1.0
  4. “Save Weights” button confirms settings

Usage Examples:

  • Want to emphasize “Theory ↔ Practice” dimension → Set its weight to 1.5
  • A dimension seems noisy and meaningless → Set its weight to 0.3
  • Want all dimensions equally weighted → Keep defaults

11.2 Step 5: Model Building

Learn the concept network using the GNG (Growing Neural Gas) algorithm. GNG is a neural network algorithm that adaptively places nodes (concept representative points) according to data distribution.

Parameter List

Basic Parameters:

ParameterRangeDefaultDescription
Max Nodes10-500Data count × 0.6Maximum nodes GNG places. More creates finer models but may add noise
Max Iterations100-50,0004,000Number of learning iterations. More ensures convergence but increases processing time
Lambda1-20020Interval for inserting new nodes. Smaller → more frequent insertion → more nodes
Max Age5-20050Threshold for deleting unused edges. Smaller → more aggressive pruning → sparser network

Algorithm Selection:

AlgorithmDescriptionRecommended Use
Default (GNG)Standard GNG. Hard assignment (each data assigned to nearest single node)Normal use (recommended)
FuzzyFuzzy membership. Each data probabilistically belongs to multiple nodesData with ambiguous boundaries
Enhanced FuzzyExtended version with repulsion and merge featuresAdvanced analysis

Fuzzy Additional Parameters:

ParameterRangeDefaultDescription
Temperature0.1-2.00.5Fuzziness. Higher → membership more evenly distributed

Enhanced Fuzzy Additional Parameters:

ParameterDescription
Temperature EndTemperature value at the end of learning
FuzzifierFuzzy membership function parameter
Repulsion BetaRepulsion force strength between nodes
Merge EpsilonMerge threshold for nodes too close together
Inertia AlphaNode movement inertia

Steps

  1. Set parameters (defaults are usually sufficient)
  2. Click “Build Model”
  3. A progress bar is displayed during building
  4. After completion, a 2D preview of the built network is shown

2D Preview Guide:

  • Each circle is a GNG node. Larger nodes have more assigned data
  • Lines between nodes are MST (Minimum Spanning Tree) edges
  • Dropdown selects X-axis and Y-axis dimensions to view the network from different angles

Parameter Tuning Guidelines:

  • Too many coarse nodes → Reduce Max Nodes
  • Too few broad nodes → Increase Max Nodes
  • Rule of thumb: Set Max Nodes to 50-70% of data row count
  • 50 rows of data → Max Nodes 25-35
  • 200 rows of data → Max Nodes 100-140

Credit Cost: 20 credits for model building (10 credits with your own API key)

11.3 Step 6: Clustering

Classify the built GNG nodes into thematic groups (clusters).

Clustering Methods

MethodDescriptionFeatures
WardHierarchical agglomerative. Minimizes within-cluster varianceConsiders MST structure. Most stable. Recommended
K-MeansCentroid-based clusteringFast. Suited for spherical clusters
HDBSCANDensity-based. Auto-detects number of clustersSuited for irregular cluster shapes
HierarchicalGeneral hierarchical clusteringSuited for dendrogram analysis
DBSCANDensity-based. Detects noiseSuited for data with many outliers

Settings

SettingDescription
Number of ClustersSpecify via dropdown. Auto-recommended value also shown
Strict ConnectivityWhen checked, only MST-connected nodes can belong to the same cluster (Ward only)
Min Cluster SizeMinimum nodes per cluster (HDBSCAN / DBSCAN)
EPSDensity threshold (DBSCAN)

Cluster Count Guidelines:

  • 30 nodes or fewer → 3-5 clusters
  • 30-100 nodes → 5-8 clusters
  • 100+ nodes → 7-12 clusters

Steps

  1. Select a clustering method
  2. Specify number of clusters (or use auto-recommended value)
  3. Click “Run Clustering”
  4. Resulting clusters are color-coded in the preview

Cluster Labeling

Auto-Labeling:

  1. Click “Auto Label”
  2. AI analyzes data assigned to each cluster’s nodes and suggests theme names
  3. Suggested labels auto-fill each cluster’s input field

Manual Labeling:

  • Enter custom names directly in each cluster’s input field

Label Examples:

  • “Technological Innovation and Implementation Challenges”
  • “User-Centered Design”
  • “Organizational Culture and Change Management”
  • “Market Trends and Business Models”

11.4 Troubleshooting

IssueCause and Solution
Too many nodes, hard to readReduce Max Nodes in Step 5 and rebuild
Cluster assignments don’t match intuitionChange the number of clusters and re-run. Try methods other than Ward (K-Means, HDBSCAN)
Auto-labels are off-targetAI suggestions are reference only. Manually correct based on your understanding of the data
“Model Build” takes too longMax Iterations may be too high. 4000 is sufficient in most cases
Going back and re-running clears later stepsBy design. Re-running from dimension reduction resets feature settings and everything after. Save important results beforehand

11.5 Model Building Tips — Understanding and Optimizing the Pipeline

ConceptMap-Text model building follows this pipeline. Understanding each step’s role and impact helps build better models.

Pipeline Overview

Text Data
  ↓ OpenAI Embedding (1536-3072 dimensions)
  ↓ UMAP Dimension Reduction (3-8 dimensions)
  ↓ GNG Learning (places nodes in reduced-dimension space)
  ↓ MST Connection (connects nodes via minimum spanning tree)
  ↓ Clustering (classifies nodes into groups)

The key point is that GNG-MST learns in the multi-dimensional space (3-8 dimensions) after UMAP dimension reduction. The 3D graph in the Explorer selects 3 dimensions for display, but the model itself is built with more dimensions.

UMAP’s Role and Impact

UMAP is not just preprocessing — it determines the very structure of the space GNG learns in. Changing UMAP parameters:

  • Changes distance relationships between concepts → Changes GNG node placement
  • Changes cluster separation → Changes clustering results

This means the same data can produce different models depending on UMAP settings.

Practical UMAP Parameter Guidelines:

Goaln_neighborsmin_distEffect
Default150.1Good for most cases
Clearly separate clusters5-100.01-0.05Emphasizes local structure. Small concept groups separate easily
Uniform distribution20-300.3-0.5GNG nodes spread evenly across the space
Emphasize global structure30-500.3-0.5Capture overall trends. See the big picture of concepts

Why uniform distribution matters: With default settings, semantically different concepts may be pulled extremely far apart. This causes GNG nodes to cluster in dense areas while leaving distant regions uncovered. Increasing min_dist or n_neighbors makes the space more uniform, allowing GNG to cover the entire dataset more evenly.

Node Count Considerations

GNG’s Max Nodes setting determines the model’s “granularity.”

Fewer nodes (about 1/3 of data count):

  • Easier to grasp the “big picture” of knowledge
  • Similar concepts merge into one node, creating a “summary-like” model
  • Easier to see relationships between clusters
  • Suitable for publishing as Mindware (coarser concept granularity leads to more natural conversations in chat)

More nodes (close to or exceeding data count):

  • Can distinguish fine conceptual differences
  • Useful for research where nuance matters
  • However, differences between nodes become small, making interpretation harder

Difference from SOM (Self-Organizing Maps): In SOM, even with more nodes than data points, nodes spread evenly across the map and empty nodes form smooth gradients. In GNG-MST, nodes gather where data exists. This is GNG’s correct behavior, directly reflecting “the structure the data speaks.” While MST visualization may show nodes clustering locally, each node’s values are distinct.

Criteria for a Good Model

There is no theoretical formula for the “optimal number of nodes.” Instead, use these practical criteria:

If clusters in the Explorer view have intuitively understandable meanings and the differences between clusters can be clearly explained, the model is appropriate.

Specific checkpoints:

  1. Cluster interpretability: Can you assign natural theme names to each cluster?
  2. Inter-cluster distinction: Do adjacent clusters have different themes?
  3. Node representativeness: Is the data assigned to each node consistent with its position (dimension values)?
  4. Overall coverage: Are important themes from the original data reflected in the model?

Iterative Improvement Process

The optimal model may not be achieved on the first try. The following iterative process is recommended:

  1. Build with default settings first — Grasp the overall picture
  2. Cluster and explore — Check model granularity and structure
  3. Adjust as needed:
  4. Clusters too large → Increase node count or decrease UMAP n_neighbors
  5. Too many nodes, hard to interpret → Decrease node count
  6. Certain concept groups don’t separate → Decrease UMAP min_dist
  7. “Save” to preserve results — Save and compare results with different settings

Always save important results before changing parameters. Going back and re-running resets subsequent steps.