Documentation
Complete Guide to Invantia Platform
📚 Documentation Sections
🚀 Getting Started
What is Invantia?
Invantia is a document intelligence platform that helps you analyze large sets of documents using AI. Unlike traditional search tools, Invantia uses intelligent corpus reduction to find precisely the content you need and package it for AI analysis.
Think of it as a smart document librarian that reads your documents, understands what you're looking for, and prepares perfectly organized summaries for AI assistants like ChatGPT, Claude, or Gemini.
5-Minute Quick Start
Step 1: Upload Documents
Go to the Desktop home page and upload PDF, DOCX, or TXT files. Files are processed entirely in your browser - nothing is uploaded to our servers.
Step 2: Select Documents or Collection
Choose which documents to search, or select a collection if you've organized related documents together.
Step 3: Choose Chat Window Size
Select "Standard" (works with all free AI accounts) or "Large" (for paid subscriptions with bigger paste limits).
Step 4: Ask Your Questions
Type natural language questions like "What are the key findings about climate change?" Invantia uses semantic search to find related content even if exact words don't match.
Step 5: Copy & Paste to AI
Click the "Copy Package" button and paste into ChatGPT, Claude, or any other AI assistant. The AI will analyze the pre-filtered, relevant content.
First-Time User Checklist
- Upload 1-3 test documents to get familiar with the interface
- Try a simple search to see how results are organized
- Create your first collection to group related documents
- Experiment with different question phrasings to see semantic matching
- Review the security page to understand data privacy
- Bookmark the Manage Documents page for easy access
🖥️ Desktop Edition User Guide
Document Management
Uploading Documents
Supported formats: PDF, DOCX, TXT
Upload methods:
- Click "Choose Files" button
- Drag and drop files onto the drop zone
- Select multiple files at once (Ctrl+Click or Cmd+Click)
Processing: Files are parsed in your browser and stored in IndexedDB. Large files (>10MB) may take a minute to process.
Organizing with Collections
Collections let you group related documents:
- Go to Manage Documents
- Click "Create Collection"
- Give it a descriptive name (e.g., "Q3 2024 Financial Reports")
- Add documents to the collection
- Use collections in searches to query multiple related documents at once
Document Types
Assign types to documents for better organization:
- Contract: Legal agreements, terms of service
- Report: Research reports, financial statements
- Email: Email threads, correspondence
- Memo: Internal memos, notes
- Other: Anything that doesn't fit above categories
Building Queries
Understanding Query Topics
Each "topic" represents a question or area of interest. Invantia searches all documents for content related to each topic and packages the results together.
Writing Effective Questions
Good questions:
- "What are the financial projections for 2025?"
- "Summarize the key risks mentioned in the contracts"
- "Find all references to data privacy requirements"
- "What are the deliverables and deadlines?"
Less effective:
- "Tell me about stuff" (too vague)
- "revenue" (better: "What were the revenue figures?")
- "Find page 17" (Invantia searches by content, not page numbers)
Semantic Search
Invantia uses semantic expansion to find related content even when exact words don't match:
- Query: "financial performance" → Also finds: "revenue", "profit", "earnings"
- Query: "deadlines" → Also finds: "due dates", "milestones", "completion dates"
- Query: "risks" → Also finds: "concerns", "issues", "challenges", "threats"
Chat Packages
What is a Chat Package?
A chat package is a formatted text bundle containing:
- Instructions for the AI on how to analyze the content
- Your original questions
- Relevant excerpts from your documents (super-chunks)
- Document metadata (titles, sources)
Using Chat Packages
- Click "Copy Package" button after search completes
- Open your preferred AI assistant (ChatGPT, Claude, Gemini, etc.)
- Paste the entire package into the chat
- Wait for the AI to analyze and respond
- Ask follow-up questions to dig deeper
Account Tier Selection
This setting optimizes package size for your AI provider's paste limits:
- Standard (30k characters): Works with free ChatGPT, Claude, Gemini accounts
- Large (100k characters): For paid subscriptions (ChatGPT Plus, Claude Pro, Gemini Advanced)
Note: These limits are imposed by AI providers, not Invantia. Invantia Desktop is always free.
Data Backup & Restore
Creating Backups
- Go to Desktop home
- Look for "Your Library" sidebar
- Click "Backup Data" button
- A JSON file downloads with all your documents and metadata
- Store this file securely (it contains your document content)
Restoring from Backup
- Go to Manage Documents
- Click "Import Backup" button
- Select your previously exported JSON file
- Wait for import to complete
⚠️ Important: Backups are stored locally on your device. If you clear browser data without a backup, your documents are permanently lost.
💡 Core Concepts
Intelligent Corpus Reduction
Traditional search returns individual results ranked by relevance. Invantia takes a different approach: corpus reduction.
Instead of showing you a list of search results, Invantia progressively filters your document set down to just the content relevant to your query, then packages it for AI consumption.
Traditional RAG vs. Invantia Super-Chunking
Traditional RAG:
- Documents chunked into small fragments (512 tokens)
- Embedding model finds "similar" chunks
- Top-K chunks sent to LLM
- ❌ Loses context and co-references
- ❌ May miss relevant content in lower-ranked chunks
Invantia Super-Chunking:
- Documents chunked into large sections (2000+ tokens)
- Hybrid scoring (TF-IDF + semantic expansion)
- All relevant chunks packaged together
- ✓ Preserves context and relationships
- ✓ User reviews what's included before sending to LLM
Hybrid Search Algorithm
Invantia combines multiple scoring mechanisms for accurate results:
- Exact term matching (100 points): Original query terms found in text
- Semantic expansion (30 points): Related terms from TF-IDF vectorization
- Proximity bonus (50 points): Terms appearing close together
- Document frequency penalty: Common terms weighted less
This hybrid approach balances precision (finding exact matches) with recall (finding semantically similar content).
Collections & Document Types
Invantia provides two orthogonal organizational systems:
-
Collections: Many-to-many groupings of related documents
Example: "Q3 2024 Board Meeting" collection contains reports, emails, and presentations -
Document Types: Functional classification of what a document IS
Example: Same document can be in multiple collections but has one type: "Report"
🔧 Technical Details
Client-Side Architecture
Technology Stack:
- Storage: IndexedDB API for persistent browser storage
- Document Parsing: PDF.js (PDF), Mammoth.js (DOCX), native APIs (TXT)
- Vectorization: TF-IDF implementation in JavaScript
- Search Engine: Custom hybrid scoring algorithm
- UI Framework: Vanilla JavaScript with Jinja2 templating
IndexedDB Schema
Database: InvantiaDB
Object Stores:
- documents: { id, filename, fileType, uploadDate, rawText, documentType }
- chunks: { id, documentId, chunkIndex, text, position }
- collections: { id, name, createdDate, documentIds[] }
- vectors: { documentId, matrix: Map<term, tfidf> }
- metadata: { key, value } // App-level settings
Search Algorithm Details
Phase 1: Term Extraction & Expansion
- Parse user query into terms
- For each term, find semantic expansions using TF-IDF vectors
- Weight original terms at 100 points, expansions at 30 points
Phase 2: Chunk Scoring
- For each chunk, calculate term frequency scores
- Apply inverse document frequency penalty
- Add proximity bonus for terms appearing within 50 characters
- Normalize scores by chunk length
Phase 3: Result Packaging
- Rank chunks by total score
- Group by document
- Format as chat package within account tier limits
Performance Considerations
- Vectorization: Computed once per document at upload time
- Search: O(n) where n = total chunks, typically <100ms for 1000 chunks
- Storage: ~2MB per 100-page PDF document (includes text + vectors)
- Memory: Processes documents in streaming fashion to handle large files
Browser Compatibility
- Chrome / Edge: Version 90+ (Fully Supported)
- Firefox: Version 88+ (Fully Supported)
- Safari: Version 14+ (Fully Supported)
- Opera: Version 76+ (Fully Supported)
🔍 Intelligent Corpus Reduction: Our Search Methodology
The Core Problem
Large Language Models face a fundamental constraint: context window limits. Even with massive 100k+ token windows, processing entire document collections becomes impractical. More critically, flooding an LLM with irrelevant content degrades response quality through what researchers call "lost in the middle" effects—the model struggles to identify and use truly relevant passages when buried in noise.
The solution? Intelligent corpus reduction: systematically reducing large document sets to precisely the content needed to answer specific queries. This isn't new thinking—it's a return to principles developed decades ago, adapted for the LLM era.
Classical Foundation: Vector Space Models (1970s-1980s)
Invantia's approach builds on the Vector Space Model (VSM), pioneered by Gerard Salton at Cornell in the 1970s for the SMART information retrieval system. The core insight was elegant: represent documents and queries as vectors in a high-dimensional space where each dimension corresponds to a term. Similarity becomes a geometric problem—documents "close" to the query vector are likely relevant.
Salton introduced term weighting schemes, most famously TF-IDF (Term Frequency-Inverse Document Frequency), which elevated important terms while downweighting common words. A term appearing frequently in one document but rarely across the collection must be significant for that document's topic. This simple heuristic proved remarkably effective and remains foundational to modern search.
Statistical Evolution: Co-occurrence and Context (1990s)
The next evolution recognized that terms don't exist in isolation—they appear in contexts. If "configure" and "GPS" frequently appear near each other across documents, they're semantically related. This insight led to co-occurrence analysis and techniques like Latent Semantic Analysis (LSA, 1990), which used singular value decomposition to discover latent semantic structures.
Invantia implements a simple but effective co-occurrence matrix: for each term, track which other terms appear within a fixed window (±7 tokens). When a user searches for "configure GPS," the system expands the query with contextually related terms like "setup," "initialization," "navigation," and "positioning"—terms that frequently co-occur in the document corpus.
This query expansion dramatically improves recall without requiring neural networks or external embeddings.
Why Not Modern Embeddings?
One might ask: why use co-occurrence matrices when we have sophisticated transformer-based embeddings? The answer reveals our key design philosophy: privacy, transparency, and computational efficiency.
Modern embedding models require:
- Sending documents to external APIs (privacy concern)
- Large model downloads (computational overhead)
- Black-box transformations (lack of auditability)
Invantia's co-occurrence approach runs entirely client-side in the browser, requires no external services, and produces explainable results. When "configure" expands to "setup," users can verify this relationship in their own documents. For legal and accounting firms—our target market—this transparency and privacy are non-negotiable.
Hybrid Scoring Algorithm
The core innovation isn't the individual techniques—it's their orchestration for corpus reduction. Invantia employs a hybrid scoring system that balances multiple signals:
Scoring Components:
- Original Query Terms (100 points each): Exact matches to user-entered terms receive maximum weight. If someone asks about "GPS configuration," chunks containing both terms rank highest.
- Semantically Expanded Terms (30 points × similarity): Co-occurrence-based expansions contribute proportionally to their similarity score. A term with 0.8 similarity contributes 24 points.
- Proximity Bonus (up to 50 points): Terms appearing close together (within 200 characters) receive additional weight. This rewards passages where concepts are discussed together, not just mentioned separately.
This creates a ranking cascade: chunks with all original terms and tight clustering rank first (precision), while chunks with related terms still surface (recall). The minimum threshold (30 points) filters noise while preserving relevant content.
From Chunks to Super Chunks
After scoring and ranking, Invantia performs intelligent packaging: grouping the top-scored chunks into "super chunks" that fit within the target LLM's context window (30k for free accounts, 100k for paid). This respects the reality that users don't paste individual 2000-character chunks—they need coherent, sized payloads ready for their AI provider.
Critically, super chunks maintain topic boundaries. If a user asks multiple questions, results for each topic are grouped separately, creating a structured package that helps the LLM understand the organizational logic.
Deterministic vs. Black-Box Retrieval
Modern RAG (Retrieval-Augmented Generation) systems often use neural retrievers—embedding models that map queries and documents to dense vectors, then retrieve by cosine similarity. This works but has drawbacks:
Neural RAG Limitations:
- Non-deterministic: Same query may return different results
- Unauditable: Why did this chunk rank #3? Hard to explain
- Resource-intensive: Requires GPU inference or API calls
- Privacy-leaking: Documents leave the user's control
Invantia's Classical Approach:
- Deterministic: Same query, same documents → same results
- Auditable: Scoring is transparent—original terms, expanded terms, proximity
- Lightweight: Pure JavaScript, runs in-browser
- Privacy-preserving: Documents never leave the device
Search Pipeline Overview
Phase 1: Query Understanding
- Extract terms from user's natural language question
- Build co-occurrence matrix from document corpus (±7 token window)
- Expand query terms with semantically related terms from matrix
- Weight original terms at 100 points, expansions at 30 points × similarity
Phase 2: Relevance Scoring
- Scan all chunks in selected documents/collections
- Calculate score for each chunk based on term matches
- Apply proximity bonus for co-located terms (within 200 chars)
- Filter chunks below minimum threshold (30 points)
- Rank remaining chunks by total score (descending)
Phase 3: Intelligent Packaging
- Group top-ranked chunks by topic
- Pack into super chunks respecting LLM context window limits
- Maintain topic boundaries across super chunks
- Format with clear delimiters for LLM consumption
- Present as ready-to-paste chat packages
Standing on the Shoulders of Giants
What Invantia demonstrates is that fundamental principles from the golden age of information retrieval (1970s-1990s) remain profoundly relevant. Salton's vector space model, TF-IDF weighting, co-occurrence analysis, query expansion—these weren't superseded by deep learning; they were validated.
The innovation is recognizing that for corpus reduction—the specific task of taking large document sets and reducing them to LLM-sized relevant subsets—you don't need the latest neural architecture. You need:
- Query understanding (semantic expansion via co-occurrence)
- Relevance ranking (hybrid scoring with multiple signals)
- Intelligent packaging (super chunks respecting LLM limits)
These are solved problems. The "new" part is applying them to the LLM workflow, creating a bridge between classical IR and modern AI chat interfaces.
Old Methods, New Context
Invantia's approach isn't revolutionary—it's evolutionary. It takes proven techniques from information retrieval's rich history and applies them to a new problem: preparing document corpora for LLM consumption.
The vector space model is 50 years old. Co-occurrence analysis is 35 years old. But for the task of intelligent corpus reduction—finding the needle in the haystack and presenting it to an AI in a digestible format—these classical methods remain remarkably effective.
As the saying goes: "There's nothing new under the sun." Invantia proves that in the age of transformer models and billion-parameter networks, sometimes the oldest ideas are still the best ones.
References & Further Reading
- Salton, G., Wong, A., & Yang, C. S. (1975). "A vector space model for automatic indexing." Communications of the ACM, 18(11), 613-620.
- Salton, G., & Buckley, C. (1988). "Term-weighting approaches in automatic text retrieval." Information Processing & Management, 24(5), 513-523.
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). "Indexing by latent semantic analysis." Journal of the American Society for Information Science, 41(6), 391-407.
- Church, K. W., & Hanks, P. (1990). "Word association norms, mutual information, and lexicography." Computational Linguistics, 16(1), 22-29.
Understanding Invantia's Co-occurrence Vectors: A Visual Example
Sample Document Text
Let's say you upload a GPS installation manual with this text:
To configure GPS, first access the navigation settings menu.
The GPS configuration requires entering the coordinates manually.
For automatic positioning, enable the GPS receiver in setup mode.
The navigation system uses satellite signals for positioning accuracy.
Configure the antenna before testing GPS functionality.
What Gets Stored: The Co-occurrence Matrix
Invantia builds a co-occurrence matrix by tracking which words appear near each other (within ±7 tokens).
For the term "GPS":
vectors.matrix.get("gps") = Map {
"configure" => 3, // appears near "gps" 3 times
"navigation" => 2, // appears near "gps" 2 times
"positioning" => 1, // appears near "gps" 1 time
"settings" => 1,
"receiver" => 1,
"antenna" => 1,
"satellite" => 0, // never appears within 7 tokens of "gps"
"functionality"=> 1
}
For the term "configure":
vectors.matrix.get("configure") = Map {
"gps" => 3,
"navigation" => 1,
"settings" => 1,
"antenna" => 1,
"positioning" => 0,
"receiver" => 0
}
For the term "navigation":
vectors.matrix.get("navigation") = Map {
"gps" => 2,
"settings" => 1,
"system" => 1,
"positioning" => 1,
"satellite" => 1,
"configure" => 1
}
Term Frequency Data
Invantia also tracks how often each term appears:
vectors.termFrequencies = Map {
"gps" => 5, // appears 5 times total
"configure" => 3,
"navigation" => 2,
"positioning" => 2,
"settings" => 1,
"receiver" => 1,
"antenna" => 1,
"satellite" => 1,
"functionality"=> 1,
// ... etc
}
vectors.totalTerms = 87 // total meaningful terms in document
How Query Expansion Works
When you search for: "How do I configure GPS"
Step 1: Extract Query Terms
queryTerms = ["configure", "gps"]
Step 2: Find Related Terms Using Vectors
For "configure":
// Look up configure in the matrix
matrix.get("configure") = {
"gps": 3,
"navigation": 1,
"settings": 1,
"antenna": 1
}
// Calculate similarity scores (co-occurrence count / total appearances)
expandedTerms = [
{ term: "gps", similarity: 3/3 = 1.0 },
{ term: "navigation", similarity: 1/3 = 0.33 },
{ term: "settings", similarity: 1/3 = 0.33 },
{ term: "antenna", similarity: 1/3 = 0.33 }
]
For "gps":
// Look up gps in the matrix
matrix.get("gps") = {
"configure": 3,
"navigation": 2,
"positioning": 1,
"settings": 1,
"receiver": 1,
"antenna": 1,
"functionality": 1
}
// Calculate similarity scores
expandedTerms = [
{ term: "configure", similarity: 3/5 = 0.60 },
{ term: "navigation", similarity: 2/5 = 0.40 },
{ term: "positioning", similarity: 1/5 = 0.20 },
{ term: "settings", similarity: 1/5 = 0.20 },
{ term: "receiver", similarity: 1/5 = 0.20 },
{ term: "antenna", similarity: 1/5 = 0.20 },
{ term: "functionality", similarity: 1/5 = 0.20 }
]
Step 3: Combine & Deduplicate
finalExpandedQuery = {
originalTerms: ["configure", "gps"],
expandedTerms: [
// From "configure"
{ term: "navigation", similarity: 0.33, source: "configure" },
{ term: "settings", similarity: 0.33, source: "configure" },
{ term: "antenna", similarity: 0.33, source: "configure" },
// From "gps"
{ term: "positioning", similarity: 0.20, source: "gps" },
{ term: "receiver", similarity: 0.20, source: "gps" },
{ term: "functionality", similarity: 0.20, source: "gps" }
// Note: "gps" and "configure" appear in each other's expansions
// but are already in originalTerms, so not duplicated
]
}
Scoring Example: How Chunks Get Ranked
Let's score two chunks from the document:
Chunk 1:
"To configure GPS, first access the navigation settings menu.
The GPS configuration requires entering the coordinates manually."
Score Calculation:
- Original term "configure": 100 points
- Original term "gps": 100 points (appears twice, counted once)
- Expanded term "navigation" (0.33 similarity): 30 × 0.33 = 10 points
- Expanded term "settings" (0.33 similarity): 30 × 0.33 = 10 points
- Proximity bonus: "configure" and "GPS" within 200 chars: 50 points
Total Score: 270 points ⭐
Chunk 2:
"The navigation system uses satellite signals for positioning accuracy."
Score Calculation:
- Original term "configure": 0 points (not present)
- Original term "gps": 0 points (not present)
- Expanded term "navigation" (0.33 similarity): 30 × 0.33 = 10 points
- Expanded term "positioning" (0.20 similarity): 30 × 0.20 = 6 points
- Proximity bonus: 0 points (original terms not present)
Total Score: 16 points (below 30 point threshold, filtered out)
Actual Data Structure in IndexedDB
Here's what's literally stored in your browser's IndexedDB:
{
documentId: 5,
// The co-occurrence matrix (converted to plain object for storage)
matrix: {
"gps": {
"configure": 3,
"navigation": 2,
"positioning": 1,
"settings": 1,
"receiver": 1,
"antenna": 1,
"functionality": 1
},
"configure": {
"gps": 3,
"navigation": 1,
"settings": 1,
"antenna": 1
},
"navigation": {
"gps": 2,
"settings": 1,
"system": 1,
"positioning": 1,
"satellite": 1,
"configure": 1
},
// ... hundreds more terms
},
// Term frequencies
termFrequencies: {
"gps": 5,
"configure": 3,
"navigation": 2,
"positioning": 2,
"settings": 1,
"receiver": 1,
// ... etc
},
totalTerms: 87,
created: "2025-12-06T23:47:26.229Z"
}
Why This Works
Key Insight: Words that appear near each other frequently are semantically related.
- "GPS" and "configure" appear together 3 times → strongly related
- "GPS" and "navigation" appear together 2 times → moderately related
- "GPS" and "satellite" appear together 0 times → not related in this document
When you search for "configure GPS", the system automatically knows to also look for:
- navigation (0.33 similarity)
- settings (0.33 similarity)
- positioning (0.20 similarity)
- receiver (0.20 similarity)
This catches relevant content even if it doesn't use your exact words!
Size & Performance
For a typical 100-page document (~50,000 words):
- Unique terms: ~2,000-3,000
- Matrix entries: ~10,000-20,000 term pairs
- Storage size: ~500KB-1MB
- Build time: 2-5 seconds during upload
- Search time: <100ms to scan all chunks
Comparison: What You DON'T See
What Invantia DOESN'T store:
// NO dense neural embeddings like:
"gps" => [0.234, -0.891, 0.445, 0.123, ... 768 more numbers]
// NO external API calls
// NO transformer models
// NO GPU processing
What RAG systems typically store:
// Dense 768-dimensional vectors (much larger!)
"gps" => Float32Array[768] {
0.23445, -0.89123, 0.44567, 0.12389,
-0.55234, 0.78901, -0.34567, 0.91234,
// ... 760 more floating point numbers
}
Size comparison:
- Invantia co-occurrence: ~1MB per 100-page doc
- Neural embeddings: ~15MB per 100-page doc (15x larger!)
Summary
Invantia's vectors are sparse, interpretable, and lightweight:
- ✅ Just counts of which words appear near each other
- ✅ Human-readable (you can inspect the matrix)
- ✅ Deterministic (same document → same vectors)
- ✅ Privacy-preserving (computed locally)
- ✅ Fast to build and search
- ✅ Small storage footprint
Instead of asking "what neural network thinks GPS is similar to", we ask "what words actually appear near GPS in YOUR documents". Simple, transparent, effective!
🔧 Troubleshooting
Common Issues
Documents won't upload
Symptoms: File picker closes but nothing happens, or upload progress stuck
Solutions:
- Check browser console (F12) for JavaScript errors
- Ensure file is under 50MB (very large files may timeout)
- Try a different browser (Chrome recommended)
- Verify file format is PDF, DOCX, or TXT
- Check available disk space (IndexedDB requires free space)
Search returns no results
Symptoms: Search completes but shows "No results found"
Solutions:
- Verify documents are selected in Step 1
- Try broader search terms (e.g., "revenue" instead of "Q3 2024 revenue projections")
- Check if documents actually contain the terms you're searching for
- Try searching individual documents to narrow down the issue
Chat package too large for AI
Symptoms: AI service rejects paste, or paste truncated
Solutions:
- Switch from "Large" to "Standard" account tier setting
- Reduce number of query topics
- Search fewer documents at once
- Use more specific queries to reduce result size
- Upgrade to paid AI subscription for larger paste limits
Lost all my documents
Symptoms: Document library shows 0 documents
Solutions:
- Check if you're in the same browser and profile as before
- Look for backup files you may have created
- Check browser settings to ensure IndexedDB wasn't cleared
- If using private/incognito mode, data is cleared when window closes
⚠️ Prevention: Regularly export backups via "Backup Data" button
Slow performance with many documents
Symptoms: Search takes >5 seconds, UI laggy
Solutions:
- Close other browser tabs to free up memory
- Use collections to search subsets rather than all documents
- Clear browser cache (not IndexedDB data)
- Consider splitting into multiple collections for better organization
Browser-Specific Issues
Safari
- IndexedDB storage may be limited to 50MB per origin
- Private browsing mode has stricter limits
- Solution: Use Chrome/Firefox for large document sets
Firefox
- May prompt for storage permission on first upload
- Containers isolate IndexedDB per container
- Solution: Use same container consistently
Mobile Browsers
- Limited memory may cause large document processing to fail
- File picker behavior varies by OS
- Solution: Use desktop browser for best experience
Getting Help
If you've tried the above solutions and still have issues:
- Check GitHub issues: Report a bug
- Include browser version, OS, and specific error messages
- Attach browser console output if possible (F12 → Console tab)