Chapter 10: ConceptMap-Text — Data Input and Preprocessing

Chapter 10: ConceptMap-Text — Data Input and Preprocessing — ThinkNavi User Manual

Chapter 10: ConceptMap-Text — Data Input and Preprocessing

ConceptMap-Text is a 7-step wizard for building concept structure models (GNG-MST) from text data. Access it by clicking “Model & Explore” in the sidebar.

10.1 Screen Layout

The ConceptMap-Text screen has the following layout:

Left Sidebar (Step Navigation):

  • 7 step buttons arranged vertically
  • Current step is highlighted
  • Incomplete steps are grayed out and non-clickable
  • Completed steps show a green checkmark
  • “Save/Load” button at the bottom

Main Area:

  • Control panel corresponding to the selected step

Step Order:

  1. Data Input → 2. Embedding → 3. Dimension Reduction → 4. Feature Settings → 5. Model Build → 6. Clustering → 7. Exploration

Each step requires the previous step to be completed. However, you can go back to a completed step and re-run it (subsequent steps will be reset).

10.2 Step 1: Data Input

Specify data sources and text columns.

Project and File Selection

  1. Select a project from the “Project” dropdown (created in AutoResearch)
  2. Select a CSV file from the “File” dropdown
  3. A preview of the first few rows and column list are displayed

Column Role Assignment

Assign roles to each CSV column by clicking buttons:

RoleButton ColorDescriptionPurpose
TextBlueUsed as text data for analysisTarget for embedding → model building
NumericGreenNumeric dataMetadata for profile analysis
CategoricalPurpleCategorical dataFiltering for cluster analysis
ReferenceYellowDisplay only as referenceShown in node details during exploration
UnusedGrayNot used in analysisExclude unnecessary columns

Important: At least one column must be set to “Text”. If no Text column is selected, you cannot proceed to the next step.

Recommended Settings Example (for concept extraction CSV):

  • concept_name_enText (main modeling target)
  • summary_enText (include summaries in modeling)
  • trigger_context_en → Text or Reference
  • abstract_structure_en → Text or Reference
  • chapter_sectionCategorical
  • chunk_indexNumeric or Unused
  • book, authorReference
  • key_terms → Reference or Unused

CCM (Connected Concept Model) Building:

Setting multiple Text columns creates independent models for each column, cross-referenced by row index to create a CCM. For example, setting both concept_name_en and summary_en to Text builds a concept name model and a summary model linked together.

Loading Data

  1. Once all column roles are set, click “Load Data”
  2. CSV data is sent to the engine
  3. On success, a preview table of the first 20 rows is displayed
  4. This step is marked “Complete” and the next step (Embedding) unlocks

Note: Re-running “Load Data” resets the engine session and all subsequent steps (embedding, dimension reduction, model, etc.).

10.3 Step 2: Embedding

Converts text data into high-dimensional vector space. Semantically similar texts become similar vectors, forming the foundation for subsequent analysis.

Settings

Model Selection:

ModelDimensionsFeaturesRecommended Use
text-embedding-3-small1536Fast, low cost. Sufficient accuracyNormal use (recommended)
text-embedding-3-large3072High accuracy. More processing time/costWhen high accuracy is needed

Dimension Selection:

OptionDescription
AutoUses the model’s default dimensions (recommended)
256Low dimensional. Fast but may lose accuracy
512Balanced
1024Medium-high accuracy
1536text-embedding-3-small default
3072text-embedding-3-large default

Recommendation: text-embedding-3-small + Auto is typically sufficient.

Steps

  1. Select model and dimensions
  2. Click “Generate Embeddings”
  3. A progress bar is displayed (seconds to minutes depending on data volume)
  4. Upon completion, proceed to the next step

Additional Features

Embedding CSV Upload:

  • Upload pre-computed embedding vectors in CSV format
  • Each row corresponds to one data item, columns are dimension values
  • Useful when using custom embedding models

Embedding CSV Download:

  • Export generated embedding vectors in CSV format
  • Useful for external tool analysis or later reuse

Credit Cost: 10 credits for embedding generation (5 credits with your own API key)

10.4 Step 3: Dimension Reduction

Compresses high-dimensional embedding vectors (e.g., 1536 dimensions) to an analyzable low-dimensional space (typically 3-8 dimensions). Each dimension can be interpreted as a conceptual “analysis axis.”

Settings

Method:

MethodDescriptionRecommended Use
UMAPNon-linear dimension reduction preserving local structure. RecommendedMost cases
PCALinear transformation maximizing variance. FastRough overview of data structure
PCA+UMAPPCA to intermediate dimensions, then UMAP to final dimensionsStabilizing high-dimensional data

Dimensions:

  • Specify output dimensions
  • Default: 3-8 based on data volume (fewer for small, more for large datasets)
  • Recommended: Under 30 concepts → 3 dimensions, 30-100 → 5, over 100 → 6-8

UMAP Parameters:

ParameterDefaultRangeDescription
n_neighbors152-100Number of neighbors. Small → emphasizes local structure (fine clusters). Large → emphasizes global structure (broad trends)
min_dist0.10.0-1.0Minimum distance. Small → dense clusters. Large → uniform distribution
metriccosinecosine / euclidean / manhattanDistance metric. Cosine recommended for text embeddings

Parameter Tuning Guide:

  • For clearly separated clusters: n_neighbors small (5-10), min_dist small (0.01-0.05)
  • For overall trends: n_neighbors large (30-50), min_dist large (0.3-0.5)
  • Default values work well for most cases

Steps

  1. Set method, dimensions, and parameters
  2. Click “Run”
  3. Upon completion, statistical information for each dimension is displayed

Dimension Interpretation (Labeling)

An important step for giving meaning to dimension reduction results.

  1. Click “Interpret Dimensions”
  2. AI analyzes each dimension’s data and suggests meaning labels
  3. For each dimension:
  4. AI-suggested labels appear as buttons. Click to select
  5. Or manually enter custom labels in the text input field

Label Examples:

  • “Dim 1” → “Concrete ↔ Abstract”
  • “Dim 2” → “Theory ↔ Practice”
  • “Dim 3” → “Individual ↔ Organization”
  • “Dim 4” → “Short-term ↔ Long-term”

Dimension labels are used during model exploration to understand the “meaning” of each node and cluster’s position.

10.5 Troubleshooting

IssueCause and Solution
“No text column selected” errorSet at least one column to “Text” role in Step 1
Embedding generation failsCheck if OpenAI API key is valid. Also check credit balance
Dimension reduction results look randomUMAP is a stochastic algorithm, so results differ between runs. This is normal. If unsatisfied, change parameters and re-run
“Engine session expired”Session times out after prolonged inactivity. Restart from “Load Data”
Dimension label suggestions are off-targetAI suggestions are reference only. Manually enter labels based on your understanding of the data