Chapter 11: ConceptMap-Text — Model Building and Clustering
11.1 Step 4: Feature Settings
A step for adjusting each dimension’s weight (influence). Emphasize specific analysis axes or dampen noisy dimensions.
Operations
Sliders are displayed for each dimension:
| Parameter | Range | Default | Description |
|---|---|---|---|
| Dimension Weight | 0.0-2.0 | 1.0 | Each dimension’s influence on model building |
- Slider right (>1.0): Emphasize this dimension. Differences along this axis are amplified
- Slider left (<1.0): De-emphasize this dimension. Differences along this axis are reduced
- 0.0: Completely ignore this dimension
Steps:
- Adjust sliders as needed
- Defaults (all 1.0) are usually sufficient
- “Reset All” button restores all dimensions to 1.0
- “Save Weights” button confirms settings
Usage Examples:
- Want to emphasize “Theory ↔ Practice” dimension → Set its weight to 1.5
- A dimension seems noisy and meaningless → Set its weight to 0.3
- Want all dimensions equally weighted → Keep defaults
11.2 Step 5: Model Building
Learn the concept network using the GNG (Growing Neural Gas) algorithm. GNG is a neural network algorithm that adaptively places nodes (concept representative points) according to data distribution.
Parameter List
Basic Parameters:
| Parameter | Range | Default | Description |
|---|---|---|---|
| Max Nodes | 10-500 | Data count × 0.6 | Maximum nodes GNG places. More creates finer models but may add noise |
| Max Iterations | 100-50,000 | 4,000 | Number of learning iterations. More ensures convergence but increases processing time |
| Lambda | 1-200 | 20 | Interval for inserting new nodes. Smaller → more frequent insertion → more nodes |
| Max Age | 5-200 | 50 | Threshold for deleting unused edges. Smaller → more aggressive pruning → sparser network |
Algorithm Selection:
| Algorithm | Description | Recommended Use |
|---|---|---|
| Default (GNG) | Standard GNG. Hard assignment (each data assigned to nearest single node) | Normal use (recommended) |
| Fuzzy | Fuzzy membership. Each data probabilistically belongs to multiple nodes | Data with ambiguous boundaries |
| Enhanced Fuzzy | Extended version with repulsion and merge features | Advanced analysis |
Fuzzy Additional Parameters:
| Parameter | Range | Default | Description |
|---|---|---|---|
| Temperature | 0.1-2.0 | 0.5 | Fuzziness. Higher → membership more evenly distributed |
Enhanced Fuzzy Additional Parameters:
| Parameter | Description |
|---|---|
| Temperature End | Temperature value at the end of learning |
| Fuzzifier | Fuzzy membership function parameter |
| Repulsion Beta | Repulsion force strength between nodes |
| Merge Epsilon | Merge threshold for nodes too close together |
| Inertia Alpha | Node movement inertia |
Steps
- Set parameters (defaults are usually sufficient)
- Click “Build Model”
- A progress bar is displayed during building
- After completion, a 2D preview of the built network is shown
2D Preview Guide:
- Each circle is a GNG node. Larger nodes have more assigned data
- Lines between nodes are MST (Minimum Spanning Tree) edges
- Dropdown selects X-axis and Y-axis dimensions to view the network from different angles
Parameter Tuning Guidelines:
- Too many coarse nodes → Reduce Max Nodes
- Too few broad nodes → Increase Max Nodes
- Rule of thumb: Set Max Nodes to 50-70% of data row count
- 50 rows of data → Max Nodes 25-35
- 200 rows of data → Max Nodes 100-140
Credit Cost: 20 credits for model building (10 credits with your own API key)
11.3 Step 6: Clustering
Classify the built GNG nodes into thematic groups (clusters).
Clustering Methods
| Method | Description | Features |
|---|---|---|
| Ward | Hierarchical agglomerative. Minimizes within-cluster variance | Considers MST structure. Most stable. Recommended |
| K-Means | Centroid-based clustering | Fast. Suited for spherical clusters |
| HDBSCAN | Density-based. Auto-detects number of clusters | Suited for irregular cluster shapes |
| Hierarchical | General hierarchical clustering | Suited for dendrogram analysis |
| DBSCAN | Density-based. Detects noise | Suited for data with many outliers |
Settings
| Setting | Description |
|---|---|
| Number of Clusters | Specify via dropdown. Auto-recommended value also shown |
| Strict Connectivity | When checked, only MST-connected nodes can belong to the same cluster (Ward only) |
| Min Cluster Size | Minimum nodes per cluster (HDBSCAN / DBSCAN) |
| EPS | Density threshold (DBSCAN) |
Cluster Count Guidelines:
- 30 nodes or fewer → 3-5 clusters
- 30-100 nodes → 5-8 clusters
- 100+ nodes → 7-12 clusters
Steps
- Select a clustering method
- Specify number of clusters (or use auto-recommended value)
- Click “Run Clustering”
- Resulting clusters are color-coded in the preview
Cluster Labeling
Auto-Labeling:
- Click “Auto Label”
- AI analyzes data assigned to each cluster’s nodes and suggests theme names
- Suggested labels auto-fill each cluster’s input field
Manual Labeling:
- Enter custom names directly in each cluster’s input field
Label Examples:
- “Technological Innovation and Implementation Challenges”
- “User-Centered Design”
- “Organizational Culture and Change Management”
- “Market Trends and Business Models”
11.4 Troubleshooting
| Issue | Cause and Solution |
|---|---|
| Too many nodes, hard to read | Reduce Max Nodes in Step 5 and rebuild |
| Cluster assignments don’t match intuition | Change the number of clusters and re-run. Try methods other than Ward (K-Means, HDBSCAN) |
| Auto-labels are off-target | AI suggestions are reference only. Manually correct based on your understanding of the data |
| “Model Build” takes too long | Max Iterations may be too high. 4000 is sufficient in most cases |
| Going back and re-running clears later steps | By design. Re-running from dimension reduction resets feature settings and everything after. Save important results beforehand |
11.5 Model Building Tips — Understanding and Optimizing the Pipeline
ConceptMap-Text model building follows this pipeline. Understanding each step’s role and impact helps build better models.
Pipeline Overview
Text Data
↓ OpenAI Embedding (1536-3072 dimensions)
↓ UMAP Dimension Reduction (3-8 dimensions)
↓ GNG Learning (places nodes in reduced-dimension space)
↓ MST Connection (connects nodes via minimum spanning tree)
↓ Clustering (classifies nodes into groups)
The key point is that GNG-MST learns in the multi-dimensional space (3-8 dimensions) after UMAP dimension reduction. The 3D graph in the Explorer selects 3 dimensions for display, but the model itself is built with more dimensions.
UMAP’s Role and Impact
UMAP is not just preprocessing — it determines the very structure of the space GNG learns in. Changing UMAP parameters:
- Changes distance relationships between concepts → Changes GNG node placement
- Changes cluster separation → Changes clustering results
This means the same data can produce different models depending on UMAP settings.
Practical UMAP Parameter Guidelines:
| Goal | n_neighbors | min_dist | Effect |
|---|---|---|---|
| Default | 15 | 0.1 | Good for most cases |
| Clearly separate clusters | 5-10 | 0.01-0.05 | Emphasizes local structure. Small concept groups separate easily |
| Uniform distribution | 20-30 | 0.3-0.5 | GNG nodes spread evenly across the space |
| Emphasize global structure | 30-50 | 0.3-0.5 | Capture overall trends. See the big picture of concepts |
Why uniform distribution matters: With default settings, semantically different concepts may be pulled extremely far apart. This causes GNG nodes to cluster in dense areas while leaving distant regions uncovered. Increasing min_dist or n_neighbors makes the space more uniform, allowing GNG to cover the entire dataset more evenly.
Node Count Considerations
GNG’s Max Nodes setting determines the model’s “granularity.”
Fewer nodes (about 1/3 of data count):
- Easier to grasp the “big picture” of knowledge
- Similar concepts merge into one node, creating a “summary-like” model
- Easier to see relationships between clusters
- Suitable for publishing as Mindware (coarser concept granularity leads to more natural conversations in chat)
More nodes (close to or exceeding data count):
- Can distinguish fine conceptual differences
- Useful for research where nuance matters
- However, differences between nodes become small, making interpretation harder
Difference from SOM (Self-Organizing Maps): In SOM, even with more nodes than data points, nodes spread evenly across the map and empty nodes form smooth gradients. In GNG-MST, nodes gather where data exists. This is GNG’s correct behavior, directly reflecting “the structure the data speaks.” While MST visualization may show nodes clustering locally, each node’s values are distinct.
Criteria for a Good Model
There is no theoretical formula for the “optimal number of nodes.” Instead, use these practical criteria:
If clusters in the Explorer view have intuitively understandable meanings and the differences between clusters can be clearly explained, the model is appropriate.
Specific checkpoints:
- Cluster interpretability: Can you assign natural theme names to each cluster?
- Inter-cluster distinction: Do adjacent clusters have different themes?
- Node representativeness: Is the data assigned to each node consistent with its position (dimension values)?
- Overall coverage: Are important themes from the original data reflected in the model?
Iterative Improvement Process
The optimal model may not be achieved on the first try. The following iterative process is recommended:
- Build with default settings first — Grasp the overall picture
- Cluster and explore — Check model granularity and structure
- Adjust as needed:
- Clusters too large → Increase node count or decrease UMAP n_neighbors
- Too many nodes, hard to interpret → Decrease node count
- Certain concept groups don’t separate → Decrease UMAP min_dist
- “Save” to preserve results — Save and compare results with different settings
Always save important results before changing parameters. Going back and re-running resets subsequent steps.