Docling/docling/backend at 7d3302cb48dd91cd29673d7c4eaf7326736d0685 - Docling - Gitea: Git with a cup of tea

NeoAnd/Docling

Files

History

Christoph Auer 7d3302cb48 feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make page.parsed_page the only source of truth for text cells

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Correctly compute PDF boxes from pymupdf

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use different OCR engine order

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add type hints and fix mypy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* One more test fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove with pypdfium2_lock from caller sites

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix typing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

2025-06-13 19:01:55 +02:00

..

ci: add coverage and ruff (#1383 )

2025-04-14 18:01:26 +02:00

feat: add Docling JSON ingestion (#783 )

2025-01-24 18:05:23 +01:00

chore: typo fix (#1465 )

2025-04-28 08:52:09 +02:00

__init__.py

Initial commit

2024-07-15 09:42:42 +02:00

abstract_backend.py

feat: add Docling JSON ingestion (#783 )

2025-01-24 18:05:23 +01:00

asciidoc_backend.py

fix: AsciiDoc header identification (#1562 ) (#1563 )

2025-05-13 11:17:26 +02:00

csv_backend.py

ci: add coverage and ruff (#1383 )

2025-04-14 18:01:26 +02:00

docling_parse_backend.py

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

docling_parse_v2_backend.py

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

docling_parse_v4_backend.py

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

html_backend.py

fix(HTML): handle row spans in header rows (#1536 )

2025-05-09 15:14:32 +02:00

md_backend.py

chore: typo fix (#1465 )

2025-04-28 08:52:09 +02:00

msexcel_backend.py

ci: add coverage and ruff (#1383 )

2025-04-14 18:01:26 +02:00

mspowerpoint_backend.py

fix: Handle NoneType error in MsPowerpointDocumentBackend (#1747 )

2025-06-10 19:43:20 +02:00

msword_backend.py

fix: Improve extraction from textboxes in Word docs (#1701 )

2025-06-06 11:37:46 +02:00

pdf_backend.py

ci: add coverage and ruff (#1383 )

2025-04-14 18:01:26 +02:00

pypdfium2_backend.py

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00