Chapter 10: ConceptMap-Text — Data Input and Preprocessing
ConceptMap-Text is a 7-step wizard for building concept structure models (GNG-MST) from text data. Access it by clicking “Model & Explore” in the sidebar.
10.1 Screen Layout
The ConceptMap-Text screen has the following layout:
Left Sidebar (Step Navigation):
- 7 step buttons arranged vertically
- Current step is highlighted
- Incomplete steps are grayed out and non-clickable
- Completed steps show a green checkmark
- “Save/Load” button at the bottom
Main Area:
- Control panel corresponding to the selected step
Step Order:
- Data Input → 2. Embedding → 3. Dimension Reduction → 4. Feature Settings → 5. Model Build → 6. Clustering → 7. Exploration
Each step requires the previous step to be completed. However, you can go back to a completed step and re-run it (subsequent steps will be reset).
10.2 Step 1: Data Input
Specify data sources and text columns.
Project and File Selection
- Select a project from the “Project” dropdown (created in AutoResearch)
- Select a CSV file from the “File” dropdown
- A preview of the first few rows and column list are displayed
Column Role Assignment
Assign roles to each CSV column by clicking buttons:
| Role | Button Color | Description | Purpose |
|---|---|---|---|
| Text | Blue | Used as text data for analysis | Target for embedding → model building |
| Numeric | Green | Numeric data | Metadata for profile analysis |
| Categorical | Purple | Categorical data | Filtering for cluster analysis |
| Reference | Yellow | Display only as reference | Shown in node details during exploration |
| Unused | Gray | Not used in analysis | Exclude unnecessary columns |
Important: At least one column must be set to “Text”. If no Text column is selected, you cannot proceed to the next step.
Recommended Settings Example (for concept extraction CSV):
concept_name_en→ Text (main modeling target)summary_en→ Text (include summaries in modeling)trigger_context_en→ Text or Referenceabstract_structure_en→ Text or Referencechapter_section→ Categoricalchunk_index→ Numeric or Unusedbook,author→ Referencekey_terms→ Reference or Unused
CCM (Connected Concept Model) Building:
Setting multiple Text columns creates independent models for each column, cross-referenced by row index to create a CCM. For example, setting both concept_name_en and summary_en to Text builds a concept name model and a summary model linked together.
Loading Data
- Once all column roles are set, click “Load Data”
- CSV data is sent to the engine
- On success, a preview table of the first 20 rows is displayed
- This step is marked “Complete” and the next step (Embedding) unlocks
Note: Re-running “Load Data” resets the engine session and all subsequent steps (embedding, dimension reduction, model, etc.).
10.3 Step 2: Embedding
Converts text data into high-dimensional vector space. Semantically similar texts become similar vectors, forming the foundation for subsequent analysis.
Settings
Model Selection:
| Model | Dimensions | Features | Recommended Use |
|---|---|---|---|
| text-embedding-3-small | 1536 | Fast, low cost. Sufficient accuracy | Normal use (recommended) |
| text-embedding-3-large | 3072 | High accuracy. More processing time/cost | When high accuracy is needed |
Dimension Selection:
| Option | Description |
|---|---|
| Auto | Uses the model’s default dimensions (recommended) |
| 256 | Low dimensional. Fast but may lose accuracy |
| 512 | Balanced |
| 1024 | Medium-high accuracy |
| 1536 | text-embedding-3-small default |
| 3072 | text-embedding-3-large default |
Recommendation: text-embedding-3-small + Auto is typically sufficient.
Steps
- Select model and dimensions
- Click “Generate Embeddings”
- A progress bar is displayed (seconds to minutes depending on data volume)
- Upon completion, proceed to the next step
Additional Features
Embedding CSV Upload:
- Upload pre-computed embedding vectors in CSV format
- Each row corresponds to one data item, columns are dimension values
- Useful when using custom embedding models
Embedding CSV Download:
- Export generated embedding vectors in CSV format
- Useful for external tool analysis or later reuse
Credit Cost: 10 credits for embedding generation (5 credits with your own API key)
10.4 Step 3: Dimension Reduction
Compresses high-dimensional embedding vectors (e.g., 1536 dimensions) to an analyzable low-dimensional space (typically 3-8 dimensions). Each dimension can be interpreted as a conceptual “analysis axis.”
Settings
Method:
| Method | Description | Recommended Use |
|---|---|---|
| UMAP | Non-linear dimension reduction preserving local structure. Recommended | Most cases |
| PCA | Linear transformation maximizing variance. Fast | Rough overview of data structure |
| PCA+UMAP | PCA to intermediate dimensions, then UMAP to final dimensions | Stabilizing high-dimensional data |
Dimensions:
- Specify output dimensions
- Default: 3-8 based on data volume (fewer for small, more for large datasets)
- Recommended: Under 30 concepts → 3 dimensions, 30-100 → 5, over 100 → 6-8
UMAP Parameters:
| Parameter | Default | Range | Description |
|---|---|---|---|
| n_neighbors | 15 | 2-100 | Number of neighbors. Small → emphasizes local structure (fine clusters). Large → emphasizes global structure (broad trends) |
| min_dist | 0.1 | 0.0-1.0 | Minimum distance. Small → dense clusters. Large → uniform distribution |
| metric | cosine | cosine / euclidean / manhattan | Distance metric. Cosine recommended for text embeddings |
Parameter Tuning Guide:
- For clearly separated clusters: n_neighbors small (5-10), min_dist small (0.01-0.05)
- For overall trends: n_neighbors large (30-50), min_dist large (0.3-0.5)
- Default values work well for most cases
Steps
- Set method, dimensions, and parameters
- Click “Run”
- Upon completion, statistical information for each dimension is displayed
Dimension Interpretation (Labeling)
An important step for giving meaning to dimension reduction results.
- Click “Interpret Dimensions”
- AI analyzes each dimension’s data and suggests meaning labels
- For each dimension:
- AI-suggested labels appear as buttons. Click to select
- Or manually enter custom labels in the text input field
Label Examples:
- “Dim 1” → “Concrete ↔ Abstract”
- “Dim 2” → “Theory ↔ Practice”
- “Dim 3” → “Individual ↔ Organization”
- “Dim 4” → “Short-term ↔ Long-term”
Dimension labels are used during model exploration to understand the “meaning” of each node and cluster’s position.
10.5 Troubleshooting
| Issue | Cause and Solution |
|---|---|
| “No text column selected” error | Set at least one column to “Text” role in Step 1 |
| Embedding generation fails | Check if OpenAI API key is valid. Also check credit balance |
| Dimension reduction results look random | UMAP is a stochastic algorithm, so results differ between runs. This is normal. If unsatisfied, change parameters and re-run |
| “Engine session expired” | Session times out after prolonged inactivity. Restart from “Load Data” |
| Dimension label suggestions are off-target | AI suggestions are reference only. Manually enter labels based on your understanding of the data |