Build an OCR Service in Node.js & Express Using Tesseract.js π§
Optical Character Recognition (OCR) is a powerful tool for extracting text from images and PDFs. In this guide, weβll build a modular OCR microservice using Node.js, Express, and Tesseract.js, with support for both image and PDF uploads. Weβll follow best practices for structure, error handling, and file management.
Table of Contents
- Introduction
- Project Setup
- Project Structure
- Installing Dependencies
- Setting Up the Express Server
- Designing the Modular Architecture
- Implementing the OCR Service
- Creating Controllers
- Defining Routes
- Testing the Service
1. Introduction
Optical Character Recognition (OCR) is a powerful technology that enables the extraction of text from images, scanned documents, or handwritten notes. By converting visual information into machine-readable text, OCR opens up a wide range of possibilities β from automating data entry to making printed documents searchable and accessible.
OCR plays a vital role in bridging the gap between the physical and digital worlds. In this article, weβll explore how to build a simple yet effective OCR service using Node.js, Express, and the Tesseract OCR engine.
Tesseract.js brings the power of Tesseract OCR to Node.js, making it easy to integrate OCR into your web services.
Weβll build a REST API that accepts image or PDF uploads and returns extracted text β using a clean, maintainable architecture.
You can find the complete source code for this OCR service on GitHub. Feel free to explore, clone, or modify the project to suit your needs β itβs completely open source! If you find it helpful or end up using it in your own projects, a βοΈ on the repo would be greatly appreciated. Your support helps keep the project alive and encourages further development!
2. Project Setup
mkdir ocr-service
cd ocr-service
npm init -y
3. Project Structure
ocr-service/
βββ src/
β βββ controllers/
β β βββ ocrController.js
β βββ routes/
β β βββ ocrRoutes.js
β βββ services/
β β βββ ocrService.js
β βββ temp/ # Temporary files for processing
β βββ server.js
β βββ eng.traineddata # Optional: Language data
βββ .gitignore
βββ package.json
βββ README.md
4. Installing Dependencies
Weβll need the following:
express
β Web frameworkmulter
β For handling file uploadstesseract.js
β OCR enginesharp
β Image processingdotenv
β Environment configcors
β CORS supportnodemon
β Auto-restart for devchild_process
β To runpdftoppm
for PDF conversion
Install with
npm install express multer tesseract.js sharp dotenv cors
npm install --save-dev nodemon
Note: PDF support requires
pdftoppm
.
macOS:brew install poppler
Ubuntu:sudo apt-get install poppler-utils
5. Setting Up the Express Server
src/server.js
import express from 'express';
import cors from 'cors';
import dotenv from 'dotenv';
import ocrRoutes from './routes/ocrRoutes.js';
dotenv.config();
const app = express();
const port = process.env.PORT || 3000;
app.use(cors());
app.use(express.json());
app.use(express.urlencoded({ extended: true }));
app.use('/', ocrRoutes);
// Centralized error handler
app.use((err, req, res, next) => {
console.error('Error caught by central handler:', err.stack);
res.status(500).json({ error: err.message || 'Something went wrong!' });
});
app.listen(port, () => {
console.log(`π§ OCR server listening at http://localhost:${port}`);
});
6. Designing the Modular Architecture
We break our logic into:
- Routes: Handle endpoints & uploads
- Controllers: Handle requests/responses
- Services: Core logic for OCR & file processing
This promotes clean separation of concerns, testability, and scalability.
7. Implementing the OCR Service
src/services/ocrService.js
import { createWorker } from 'tesseract.js';
import sharp from 'sharp';
import fs from 'fs/promises';
import path from 'path';
import { fileURLToPath } from 'url';
import { exec } from 'child_process';
import { promisify } from 'util';
const execAsync = promisify(exec);
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
const tempDir = path.join(__dirname, '../temp');
await fs.mkdir(tempDir, { recursive: true });
const performOCR = async (imageBuffer) => {
const worker = await createWorker('eng');
try {
const { data: { text } } = await worker.recognize(imageBuffer);
return text;
} finally {
await worker.terminate();
}
};
const performPDFOCR = async (pdfBuffer) => {
let worker = null;
const tempFiles = [];
try {
const pdfPath = path.join(tempDir, `temp_${Date.now()}.pdf`);
await fs.writeFile(pdfPath, pdfBuffer);
tempFiles.push(pdfPath);
const outputPrefix = path.join(tempDir, `page_${Date.now()}`);
await execAsync(`pdftoppm -png -r 300 "${pdfPath}" "${outputPrefix}"`);
const files = await fs.readdir(tempDir);
const pageFiles = files
.filter(file => file.startsWith(path.basename(outputPrefix)))
.sort();
worker = await createWorker('eng');
let extractedText = '';
for (const pageFile of pageFiles) {
const pagePath = path.join(tempDir, pageFile);
tempFiles.push(pagePath);
const imageBuffer = await fs.readFile(pagePath);
const processedImage = await sharp(imageBuffer).sharpen().toBuffer();
const { data: { text } } = await worker.recognize(processedImage);
extractedText += text + '\n\n';
}
return extractedText.trim();
} finally {
if (worker) await worker.terminate();
for (const file of tempFiles) {
try { await fs.unlink(file); } catch {}
}
}
};
export default { performOCR, performPDFOCR };
8. Creating Controllers
src/controllers/ocrController.js
import ocrService from '../services/ocrService.js';
const handleHealthCheck = (req, res) => {
res.json({ message: 'OCR server is running!' });
};
const handleOCRRequest = async (req, res, next) => {
if (!req.file) {
return res.status(400).json({ error: 'No file uploaded' });
}
try {
const text = await ocrService.performPDFOCR(req.file.buffer);
res.json({ text });
} catch (error) {
next(new Error('PDF OCR processing failed'));
}
};
export { handleHealthCheck, handleOCRRequest };
9. Defining Routes
src/routes/ocrRoutes.js
import express from 'express';
import multer from 'multer';
import { handleHealthCheck, handleOCRRequest } from '../controllers/ocrController.js';
const router = express.Router();
const upload = multer({ storage: multer.memoryStorage() });
router.get('/', handleHealthCheck);
router.post('/ocr', upload.single('image'), handleOCRRequest);
export default router;
10. Testing the Service
You can test the OCR service using curl
or Postman:
Image or PDF Upload
curl -X POST http://localhost:3000/ocr \
-F "image=@/path/to/your/file.png"
curl -X POST http://localhost:3000/ocr \
-F "image=@/path/to/your/file.pdf"
Sample Response:
{
"text": "Extracted text from your PDF or image..."
}
Error Handling
All errors are caught by centralized middleware and returned as JSON:
{
"error": "PDF OCR processing failed"
}
Environment Variables
You can use a .env
file to configure settings like PORT
, future API keys, etc.
Conclusion
You now have a working OCR microservice that:
- Accepts images and PDFs
- Extracts text using Tesseract.js
- Follows a modular and clean architecture
- Cleans up temp files automatically