author: @kawam tags:#nlp#ocr


Related: OCR dan NLP]

As a programmer assigned to build a system that scans invoices and extracts key information to store in a database, here’s a high-level overview of the approach, along with the libraries and tools I would use:

  1. Text extraction with OCR: Use the Tesseract OCR engine to extract text from invoice images. To do this in Python, use the pytesseract library.

  2. Preprocessing and cleaning: Clean the extracted text by removing unwanted characters, correcting common OCR errors, and normalizing the text. Use Python’s built-in string methods and the re module for this step.

  3. Tokenization and parsing: Analyze the structure of the text using the SpaCy library, which provides tools for tokenization, part-of-speech tagging, and dependency parsing.

  4. Rule-based matching and regex patterns: Use regular expressions or rule-based matching to identify and extract specific fields such as dates, currency amounts, and invoice numbers. You can use the re module for regex patterns and SpaCy’s Matcher for rule-based matching.

  5. Named entity recognition (NER) or text classification: Train an NLP model to recognize and classify specific fields in the invoice text. You can use pre-trained models from Hugging Face Transformers or train custom models using SpaCy.

  6. Storing data in the database: Once the information is extracted and classified, insert the data into the appropriate fields in your database using a database library like SQLAlchemy (for SQL databases) or PyMongo (for MongoDB).

Here’s a high-level workflow for a Python programmer:

  1. Install necessary libraries:
pip install pytesseract opencv-python spacy transformers sqlalchemy
  1. Perform OCR with Tesseract and pytesseract:
import pytesseract from PIL import Image
 
image = Image.open("invoice.jpg") 
text = pytesseract.image_to_string(image)
  1. Preprocess and clean the text:
import re  
cleaned_text = re.sub(r"[^a-zA-Z0-9\s.,$%]+", " ", text)
  1. Tokenize and parse with SpaCy:
import spacy
 
nlp = spacy.load("en_core_web_sm") 
doc = nlp(cleaned_text)
  1. Extract fields with regex patterns and rule-based matching:
invoice_number_pattern = r"\bINV\d{5}\b" 
invoice_number = re.search(invoice_number_pattern, cleaned_text)  
if invoice_number:     
	invoice_number = invoice_number.group()
  1. Train and use an NLP model for field extraction:
from transformers import pipeline  
 
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
results = ner_pipeline(cleaned_text)
  1. Store the extracted information in the database:
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
 
# Define the database model
Base = declarative_base()
 
 
class Invoice(Base):
    __tablename__ = "invoices"
    id = Column(Integer, primary_key=True)
    invoice_number = Column(String)
    # Add more fields as needed
 
 
# Connect to the database
engine = create_engine("sqlite:///invoices.db")
Base.metadata.create_all(engine)
 
Session = sessionmaker(bind=engine)
session = Session()
 
# Insert the extracted data
new_invoice = Invoice(invoice_number="example_invoice_number")
session.add(new_invoice)
session.commit()