AI-Powered Language Digitization Platform

Digitizing the Uzbek Language
Through Advanced AI

An end-to-end AI pipeline that transforms scanned historical Uzbek dictionaries into structured, searchable digital resources — leveraging multimodal LLMs for OCR, NLP, and linguistic analysis at scale.

View Live Dashboard View Pitch Deck

Live Statistics

Real-Time Project Dashboard

All metrics are computed live from our production PostgreSQL database, reflecting the current state of the digitization pipeline.

Dictionary Volumes

Scanned Pages

OCR Processed

Extracted Headwords

Human-Verified Pages

100.0%

Overall OCR Completion

Headword Distribution by Letter

Page Status Breakdown

Processing Activity (Last 7 Days)

Architecture

End-to-End Processing Pipeline

From physical book pages to structured linguistic data — every step is automated and AI-powered.

PDF Upload

Scanned dictionary volumes

Page Splitting

Automated segmentation

AI-OCR

Gemini multimodal vision

Human Review

Expert linguist verification

Structured Data

Headwords, definitions, phonetics

Technology

Built on Google Cloud

Our AI-first architecture leverages the best of Google's AI ecosystem for unparalleled accuracy.

🧠

Google Gemini 3.1 Pro

Multimodal large language model for advanced OCR of historical Uzbek script with context-aware text recognition and Markdown formatting.

Vertex AI Vision API Multimodal

☁️

Google Cloud Platform

Enterprise-grade infrastructure with Compute Engine VMs, Cloud Storage, and managed PostgreSQL for reliable and scalable processing.

Compute Engine Cloud Storage PostgreSQL

⚡

Async Task Processing

Celery + Redis distributed task queue enables parallel OCR processing of thousands of pages with automatic retry and error handling.

Celery Redis Docker

📖

NLP & Linguistics

Automated headword extraction, etymology tagging, part-of-speech classification, and phonetic transcription for complete linguistic analysis.

Regex NLP G2P IPA

Per-Volume Tracking

Digitization Progress

Real-time completion tracking for each dictionary volume in the pipeline.

Volume	Pages	OCR Progress	OCR %	Parsed	Parse %
B	508	508	100%	508	100%
T	547	547	100%	547	100%
J	95	95	100%	95	100%
X	110	110	100%	110	100%
G	106	106	100%	106	100%
Ch	160	160	100%	160	100%
A	191	191	100%	191	100%
E	87	87	100%	87	100%
D	230	230	100%	230	100%
F	83	83	100%	83	100%
H	180	180	100%	180	100%
S	345	345	100%	344	100%
G'	63	63	100%	63	100%
L	61	61	100%	61	100%
K	331	331	100%	330	100%
I	149	149	100%	149	100%
M	256	256	100%	255	100%
N	100	100	100%	100	100%
O	197	197	100%	196	99%
P	225	225	100%	224	100%
Q	361	361	100%	361	100%
Sh	142	142	100%	133	94%
V	72	72	100%	66	92%
Z	76	76	100%	71	93%
Y	208	208	100%	208	100%
U	95	95	100%	95	100%
R	113	113	100%	113	100%
O'	99	99	100%	98	99%

Presentation

Project Pitch Deck

An in-depth overview of our AI-powered linguistic pipeline — architecture, impact, and vision.

Digitizing the Uzbek Language Through Advanced AI