pytesseract

Python-tesseract 是 Python 的光学字符识别 (OCR) 工具。也就是说，它将识别并“读取”图像中嵌入的文本。

Python-tesseract 是 Google's Tesseract-OCR Engine 包装器。它也可用作 tesseract 的独立调用脚本，因为它可以读取 Pillow 和 Leptonica 成像库支持的所有图像类型，包括 jpeg、png、gif、bmp、tiff 等。此外，如果用作脚本，Python-tesseract 将打印识别的文本，而不是将其写入文件。

USAGE

Quickstart

Note: Test images are located in the tests/data folder of the Git repo.

Library usage:

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

# 如果没有添加环境变量，需要加这一句:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

# 简单图像转字符串
print(pytesseract.image_to_string(Image.open('test.png')))

# 可用语言列表
print(pytesseract.get_languages(config=''))

# 法语文本图像到字符串
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

# 可以使用相对或绝对图像路径
# 注意：在这种情况下，您应该提供 tesseract 支持的图像，否则 tesseract 将返回错误
print(pytesseract.image_to_string('test.png'))

# 使用包含多个图像文件路径列表的单个文件进行批处理
print(pytesseract.image_to_string('images.txt'))

# 超时/一段时间后终止 tesseract 作业
try:
    print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds
    print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second
except RuntimeError as timeout_error:
    # Tesseract processing is terminated
    pass

# Get bounding box estimates  # 获取边界框估计
print(pytesseract.image_to_boxes(Image.open('test.png')))

# Get verbose data including boxes, confidences, line and page numbers  # 获取详细数据，包括框、置信度、行号和页码
print(pytesseract.image_to_data(Image.open('test.png')))

# Get information about orientation and script detection  # 获取有关方向和脚本检测
print(pytesseract.image_to_osd(Image.open('test.png')))

# Get a searchable PDF  # 获取可搜索的 PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

# Get HOCR output  # 获取 HOCR 输出
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')

# Get ALTO XML output  # 获取 ALTO XML 输出
xml = pytesseract.image_to_alto_xml('test.png')

Support for OpenCV image/NumPy array objects # 支持 OpenCV 图像/NumPy 数组对象

import cv2

img_cv = cv2.imread(r'/<path_to_image>/digits.png')

# 默认情况下，OpenCV 以 BGR 格式存储图像，并且由于 pytesseract 采用 RGB 格式，
# 我们需要从 BGR 转换为 RGB 格式/模式：
img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
print(pytesseract.image_to_string(img_rgb))
# OR
img_rgb = Image.frombytes('RGB', img_cv.shape[:2], img_cv, 'raw', 'BGR', 0, 0)
print(pytesseract.image_to_string(img_rgb))

If you need custom configuration like oem/psm, use the config keyword.
如果您需要像oem / psm这样的自定义配置，请使用config关键字。

# Example of adding any additional options  # 添加任何附加选项的示例
custom_oem_psm_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config)

# Example of using pre-defined tesseract config file with options
# 使用带有 预定义 tesseract 配置文件的示例。
cfg_filename = 'words'
pytesseract.run_and_get_output(image, extension='txt', config=cfg_filename)

Add the following config, if you have tessdata error like: "Error opening data file..."
添加以下配置，如果 tessdata 报错，例如：: “Error opening data file…”

# Example config: r'--tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata"'
# It's important to add double quotes around the dir path.  # 在 dir 路径周围添加双引号很重要。
tessdata_dir_config = r'--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)

Functions

get_languages Returns all currently supported languages by Tesseract OCR.
get_tesseract_version Returns the Tesseract version installed in the system.
image_to_string Returns unmodified output as string from Tesseract OCR processing
image_to_boxes Returns result containing recognized characters and their box boundaries
image_to_data Returns result containing box boundaries, confidences, and other information. Requires Tesseract 3.05+. For more information, please check the Tesseract TSV documentation
image_to_osd Returns result containing information about orientation and script detection.
image_to_alto_xml Returns result in the form of Tesseract's ALTO XML format.
run_and_get_output Returns the raw output from Tesseract OCR. Gives a bit more control over the parameters that are sent to tesseract.

get_languages返回 Tesseract OCR 当前支持的所有语言。
get_tesseract_version返回系统中安装的 Tesseract 版本。
image_to_string将未经修改的输出作为来自 Tesseract OCR 处理的字符串返回
image_to_boxes返回包含已识别字符及其框边界的结果
image_to_data返回包含框边界、置信度和其他信息的结果。需要 Tesseract 3.05+。有关更多信息，请查看Tesseract TSV 文档 https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#tsv-output-currently-available-in-305-dev-in-master-branch-on-github
image_to_osd返回包含方向和脚本检测信息的结果。
image_to_alto_xml以 Tesseract 的 ALTO XML 格式返回结果。
run_and_get_output从 Tesseract OCR 返回原始输出。对发送到 tesseract 的参数提供更多控制。

Parameters

image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0, pandas_config=None)

image Object or String - PIL Image/NumPy array or file path of the image to be processed by Tesseract. If you pass object instead of file path, pytesseract will implicitly convert the image to RGB mode.
lang String - Tesseract language code string. Defaults to eng if not specified! Example for multiple languages: lang='eng+fra'
config String - Any additional custom configuration flags that are not available via the pytesseract function. For example: config='--psm 6'
nice Integer - modifies the processor priority for the Tesseract run. Not supported on Windows. Nice adjusts the niceness of unix-like processes.
output_type Class attribute - specifies the type of the output, defaults to string. For the full list of all supported types, please check the definition of pytesseract.Output class.
timeout Integer or Float - duration in seconds for the OCR processing, after which, pytesseract will terminate and raise RuntimeError.
pandas_config Dict - only for the Output.DATAFRAME type. Dictionary with custom arguments for pandas.read_csv. Allows you to customize the output of image_to_data.

image对象或字符串 - 要由 Tesseract 处理的图像的 PIL Image/NumPy 数组或文件路径。如果您传递对象而不是文件路径，pytesseract 将隐式将图像转换为RGB 模式。
lang String - Tesseract 语言代码字符串。如果未指定，则默认为eng ！多语言示例：lang='eng+fra'
config String -通过 pytesseract 函数不可用的任何其他自定义配置标志。例如：config='--psm 6'
nice Integer - 修改 Tesseract 运行的处理器优先级。在 Windows 上不支持。适配 unix-like 系统。
output_type类属性 - 指定输出的类型，默认为string。有关所有支持类型的完整列表，请查看pytesseract.Output类的定义。
timeout Integer 或 Float - OCR 处理的持续时间（以秒为单位），之后，pytesseract 将终止并引发 RuntimeError。
pandas_config Dict - 仅适用于Output.DATAFRAME类型。带有pandas.read_csv自定义参数的字典。允许您自定义image_to_data的输出。

CLI usage:

pytesseract [-l lang] image_file

INSTALLATION

Prerequisites:

Python-tesseract requires Python 3.6+
You will need the Python Imaging Library (PIL) (or the Pillow fork). Under Debian/Ubuntu, this is the package python-imaging or python3-imaging.
Install Google Tesseract OCR (additional info how to install the engine on Linux, Mac OSX and Windows). You must be able to invoke the tesseract command as tesseract. If this isn't the case, for example because tesseract isn't in your PATH, you will have to change the "tesseract_cmd" variable pytesseract.pytesseract.tesseract_cmd. Under Debian/Ubuntu you can use the package tesseract-ocr. For Mac OS users. please install homebrew package tesseract.

Note: Make sure that you also have installed tessconfigs and configs from tesseract-ocr/tessconfigs or via the OS package manager.

Installing via pip:

Check the pytesseract package page for more information.

pip install pytesseract

Or if you have git installed:

pip install -U git+https://github.com/madmaze/pytesseract.git

Installing from source:

git clone https://github.com/madmaze/pytesseract.git
cd pytesseract && pip install -U .

Install with conda (via conda-forge):

conda install -c conda-forge pytesseract

命令行用法:
pytesseract [-l lang] image_file

pytesseract-0.2.0.tar.gz 支持python2.5
pytesseract-0.2.7.tar.gz 支持python2.7 python2/3都用这个版本
pytesseract-0.3.7.tar.gz 支持python2.7 识别图像输出文本多了空格不好
pytesseract-0.3.8.tar.gz 支持python3.6
pytesseract-0.3.9.tar.gz 支持python3.7