Chapter 10: ConceptMap-Text — Data Input and Preprocessing

Chapter 10: ConceptMap-Text — Data Input and Preprocessing — ThinkNavi User Manual

Chapter 10: ConceptMap-Text — Data Input and Preprocessing

ConceptMap-Text is a 7-step wizard for building concept structure models (GNG-MST) from text data. Access it by clicking “Model & Explore” in the sidebar.

10.1 Screen Layout

The ConceptMap-Text screen has the following layout:

Left Sidebar (Step Navigation):

7 step buttons arranged vertically
Current step is highlighted
Incomplete steps are grayed out and non-clickable
Completed steps show a green checkmark
“Save/Load” button at the bottom

Main Area:

Control panel corresponding to the selected step

Step Order:

Data Input → 2. Embedding → 3. Dimension Reduction → 4. Feature Settings → 5. Model Build → 6. Clustering → 7. Exploration

Each step requires the previous step to be completed. However, you can go back to a completed step and re-run it (subsequent steps will be reset).

10.2 Step 1: Data Input

Specify data sources and text columns.

Project and File Selection

Select a project from the “Project” dropdown (created in AutoResearch)
Select a CSV file from the “File” dropdown
A preview of the first few rows and column list are displayed

Column Role Assignment

Assign roles to each CSV column by clicking buttons:

Role	Button Color	Description	Purpose
Text	Blue	Used as text data for analysis	Target for embedding → model building
Numeric	Green	Numeric data	Metadata for profile analysis
Categorical	Purple	Categorical data	Filtering for cluster analysis
Reference	Yellow	Display only as reference	Shown in node details during exploration
Unused	Gray	Not used in analysis	Exclude unnecessary columns

Important: At least one column must be set to “Text”. If no Text column is selected, you cannot proceed to the next step.

Recommended Settings Example (for concept extraction CSV):

concept_name_en → Text (main modeling target)
summary_en → Text (include summaries in modeling)
trigger_context_en → Text or Reference
abstract_structure_en → Text or Reference
chapter_section → Categorical
chunk_index → Numeric or Unused
book, author → Reference
key_terms → Reference or Unused

CCM (Connected Concept Model) Building:

Setting multiple Text columns creates independent models for each column, cross-referenced by row index to create a CCM. For example, setting both concept_name_en and summary_en to Text builds a concept name model and a summary model linked together.

Loading Data

Once all column roles are set, click “Load Data”
CSV data is sent to the engine
On success, a preview table of the first 20 rows is displayed
This step is marked “Complete” and the next step (Embedding) unlocks

Note: Re-running “Load Data” resets the engine session and all subsequent steps (embedding, dimension reduction, model, etc.).

10.3 Step 2: Embedding

Converts text data into high-dimensional vector space. Semantically similar texts become similar vectors, forming the foundation for subsequent analysis.

Settings

Model Selection:

Model	Dimensions	Features	Recommended Use
text-embedding-3-small	1536	Fast, low cost. Sufficient accuracy	Normal use (recommended)
text-embedding-3-large	3072	High accuracy. More processing time/cost	When high accuracy is needed

Dimension Selection:

Option	Description
Auto	Uses the model’s default dimensions (recommended)
256	Low dimensional. Fast but may lose accuracy
512	Balanced
1024	Medium-high accuracy
1536	text-embedding-3-small default
3072	text-embedding-3-large default

Recommendation: text-embedding-3-small + Auto is typically sufficient.

Steps

Select model and dimensions
Click “Generate Embeddings”
A progress bar is displayed (seconds to minutes depending on data volume)
Upon completion, proceed to the next step

Additional Features

Embedding CSV Upload:

Upload pre-computed embedding vectors in CSV format
Each row corresponds to one data item, columns are dimension values
Useful when using custom embedding models

Embedding CSV Download:

Export generated embedding vectors in CSV format
Useful for external tool analysis or later reuse

Credit Cost: 10 credits for embedding generation (5 credits with your own API key)

10.4 Step 3: Dimension Reduction

Compresses high-dimensional embedding vectors (e.g., 1536 dimensions) to an analyzable low-dimensional space (typically 3-8 dimensions). Each dimension can be interpreted as a conceptual “analysis axis.”

Settings

Method:

Method	Description	Recommended Use
UMAP	Non-linear dimension reduction preserving local structure. Recommended	Most cases
PCA	Linear transformation maximizing variance. Fast	Rough overview of data structure
PCA+UMAP	PCA to intermediate dimensions, then UMAP to final dimensions	Stabilizing high-dimensional data

Dimensions:

Specify output dimensions
Default: 3-8 based on data volume (fewer for small, more for large datasets)
Recommended: Under 30 concepts → 3 dimensions, 30-100 → 5, over 100 → 6-8

UMAP Parameters:

Parameter	Default	Range	Description
n_neighbors	15	2-100	Number of neighbors. Small → emphasizes local structure (fine clusters). Large → emphasizes global structure (broad trends)
min_dist	0.1	0.0-1.0	Minimum distance. Small → dense clusters. Large → uniform distribution
metric	cosine	cosine / euclidean / manhattan	Distance metric. Cosine recommended for text embeddings

Parameter Tuning Guide:

For clearly separated clusters: n_neighbors small (5-10), min_dist small (0.01-0.05)
For overall trends: n_neighbors large (30-50), min_dist large (0.3-0.5)
Default values work well for most cases

Steps

Set method, dimensions, and parameters
Click “Run”
Upon completion, statistical information for each dimension is displayed

Dimension Interpretation (Labeling)

An important step for giving meaning to dimension reduction results.

Click “Interpret Dimensions”
AI analyzes each dimension’s data and suggests meaning labels
For each dimension:
AI-suggested labels appear as buttons. Click to select
Or manually enter custom labels in the text input field

Label Examples:

“Dim 1” → “Concrete ↔ Abstract”
“Dim 2” → “Theory ↔ Practice”
“Dim 3” → “Individual ↔ Organization”
“Dim 4” → “Short-term ↔ Long-term”

Dimension labels are used during model exploration to understand the “meaning” of each node and cluster’s position.

10.5 Troubleshooting

Issue	Cause and Solution
“No text column selected” error	Set at least one column to “Text” role in Step 1
Embedding generation fails	Check if OpenAI API key is valid. Also check credit balance
Dimension reduction results look random	UMAP is a stochastic algorithm, so results differ between runs. This is normal. If unsatisfied, change parameters and re-run
“Engine session expired”	Session times out after prolonged inactivity. Restart from “Load Data”
Dimension label suggestions are off-target	AI suggestions are reference only. Manually enter labels based on your understanding of the data

Chapter 10: ConceptMap-Text — Data Input and Preprocessing