AI-Powered Language Digitization Platform

Digitizing the Uzbek Language
Through Advanced AI

An end-to-end AI pipeline that transforms scanned historical Uzbek dictionaries into structured, searchable digital resources — leveraging multimodal LLMs for OCR, NLP, and linguistic analysis at scale.

Real-Time Project Dashboard

All metrics are computed live from our production PostgreSQL database, reflecting the current state of the digitization pipeline.

0
Dictionary Volumes
0
Scanned Pages
0
OCR Processed
0
Extracted Headwords
0
Human-Verified Pages
100.0%
Overall OCR Completion

Headword Distribution by Letter

Page Status Breakdown

Processing Activity (Last 7 Days)

End-to-End Processing Pipeline

From physical book pages to structured linguistic data — every step is automated and AI-powered.

PDF Upload

Scanned dictionary volumes

Page Splitting

Automated segmentation

AI-OCR

Gemini multimodal vision

Human Review

Expert linguist verification

Structured Data

Headwords, definitions, phonetics

Built on Google Cloud

Our AI-first architecture leverages the best of Google's AI ecosystem for unparalleled accuracy.

🧠

Google Gemini 3.1 Pro

Multimodal large language model for advanced OCR of historical Uzbek script with context-aware text recognition and Markdown formatting.

Vertex AI Vision API Multimodal
☁️

Google Cloud Platform

Enterprise-grade infrastructure with Compute Engine VMs, Cloud Storage, and managed PostgreSQL for reliable and scalable processing.

Compute Engine Cloud Storage PostgreSQL

Async Task Processing

Celery + Redis distributed task queue enables parallel OCR processing of thousands of pages with automatic retry and error handling.

Celery Redis Docker
📖

NLP & Linguistics

Automated headword extraction, etymology tagging, part-of-speech classification, and phonetic transcription for complete linguistic analysis.

Regex NLP G2P IPA

Digitization Progress

Real-time completion tracking for each dictionary volume in the pipeline.

Volume Pages OCR Progress OCR % Parsed Parse %
L 61 61
100%
61
100%
E 87 87
100%
87
100%
G 106 106
100%
106
100%
A 191 191
100%
191
100%
B 508 508
100%
508
100%
Z 76 76
100%
71
93%
Ch 160 160
100%
160
100%
X 110 110
100%
0
0%
Y 208 208
100%
0
0%
V 72 72
100%
66
92%
U 95 95
100%
95
100%
T 547 547
100%
0
0%
D 230 230
100%
230
100%
F 83 83
100%
83
100%
G' 63 63
100%
63
100%
H 180 180
100%
180
100%
I 149 149
100%
149
100%
K 331 331
100%
330
100%
J 95 95
100%
95
100%
M 256 256
100%
255
100%
N 100 100
100%
100
100%
O 197 197
100%
196
99%
O' 99 99
100%
98
99%
Q 361 361
100%
361
100%
R 113 113
100%
113
100%
Sh 142 142
100%
133
94%
P 225 225
100%
224
100%
S 345 345
100%
0
0%

Project Pitch Deck

An in-depth overview of our AI-powered linguistic pipeline — architecture, impact, and vision.

Uzbek_Linguistic_AI_Pipeline.pdf