Commit Graph

  • f4a1c06937 chore: bump version to 2.40.0 [skip ci] main github-actions[bot] 2025-07-04 15:31:36 +00:00
  • ec6cf6f7e8 feat: Introduce LayoutOptions to control layout postprocessing behaviour (#1870) Christoph Auer 2025-07-04 15:36:13 +02:00
  • 598c9c53d4 fix: Secure torch model inits with global locks (#1884) Christoph Auer 2025-07-04 07:27:26 +02:00
  • 13865c06f5 perf(msexcel): _find_table_bounds use iter_rows/iter_cols instead of Worksheet.cell (#1875) Qiefan Jiang 2025-07-03 19:12:06 +08:00
  • 3089cf2d26 perf: Move expensive imports closer to usage (#1863) William Easton 2025-07-01 15:27:17 -05:00
  • 56a0e104f7 feat: Integrate ListItemMarkerProcessor into document assembly (#1825) Christoph Auer 2025-07-01 10:04:58 +02:00
  • bdfee4e2d0 chore: Safer unloading of DPv4 backend (#1867) Christoph Auer 2025-06-30 14:41:21 +02:00
  • ae39a9411a fix: Ensure that TesseractOcrModel does not crash in case OSD is not installed (#1866) Nikos Livathinos 2025-06-30 10:55:56 +02:00
  • bb99be6c24 chore: bump version to 2.39.0 [skip ci] github-actions[bot] 2025-06-27 15:37:53 +00:00
  • 0533da1923 feat: leverage new list modeling, capture default markers (#1856) Panos Vagenas 2025-06-27 16:37:15 +02:00
  • e79e4f0ab6 fix(markdown): make parsing of rich table cells valid (#1821) Michael Honaker 2025-06-26 13:50:45 -04:00
  • ee4781075a chore: bump version to 2.38.1 [skip ci] github-actions[bot] 2025-06-25 16:27:46 +00:00
  • d337825b8e fix: updated granite vision model version for picture description (#1852) pranaymiri 2025-06-25 21:19:56 +05:30
  • 7c5614a37a fix(markdown): fix single-formatted headings & list items (#1820) Panos Vagenas 2025-06-25 13:05:06 +02:00
  • 41e8cae26b fix: fix response type of ollama (#1850) Michele Dolfi 2025-06-25 04:33:09 -05:00
  • 4002de1f92 fix: Handle missing runs to avoid out of range exception (#1844) Allen N. 2025-06-24 22:55:27 -07:00
  • 1dc63d0aa9 chore: bump version to 2.38.0 [skip ci] github-actions[bot] 2025-06-23 18:14:24 +00:00
  • f3ae3029b8 docs: update readme and add ASR example (#1836) Peter W. J. Staar 2025-06-23 18:55:16 +02:00
  • 1557e7ce3e feat: Support audio input (#1763) Peter W. J. Staar 2025-06-23 14:47:26 +02:00
  • d26dac61a8 fix(docx): ensure list items have a list parent (#1827) Cesar Berrospi Ramis 2025-06-20 14:47:25 +02:00
  • 1350a8d3e5 fix(msword_backend): Identify text in the same line after an image #1425 (#1610) mkrssg 2025-06-20 10:55:30 +02:00
  • 64ac043786 docs: support running examples from root or subfolder (#1816) Michele Dolfi 2025-06-19 04:10:40 -05:00
  • dd7f64ff28 fix: Ensure uninitialized pages are removed before assembling document (#1812) Christoph Auer 2025-06-19 07:33:25 +02:00
  • 861abcdcb0 feat(markdown): add formatting & improve inline support (#1804) Panos Vagenas 2025-06-18 15:57:57 +02:00
  • 215b540f6c feat: Maximum image size for Vlm models (#1802) Shkarupa Alex 2025-06-18 13:57:37 +03:00
  • dbab30e92c fix: formula conversion with page_range param set (#1791) Mahafuzur Rahman 2025-06-17 17:58:45 +06:00
  • c2ef69718a chore: dco advisor (#1795) Michele Dolfi 2025-06-17 02:45:56 -05:00
  • 7bae3b6c06 chore: bump version to 2.37.0 [skip ci] github-actions[bot] 2025-06-16 11:02:54 +00:00
  • f28d23cf03 fix: pptx line break and space handling (#1664) Martin Wind 2025-06-16 10:44:30 +02:00
  • b886e4df31 fix(asciidoc): set default size when missing in image directive (#1769) Cesar Berrospi Ramis 2025-06-16 10:38:46 +02:00
  • 7d3302cb48 feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) Christoph Auer 2025-06-13 19:01:55 +02:00
  • 0432a31b2f docs: update vlm models api examples with LM Studio (#1759) Michele Dolfi 2025-06-12 05:58:44 -05:00
  • 7a275c7637 fix: Handle NoneType error in MsPowerpointDocumentBackend (#1747) Bruno Rigal 2025-06-10 19:43:20 +02:00
  • df140227c3 feat: support xlsm files (#1520) Ayraf 2025-06-10 20:25:59 +05:30
  • 6613b9e98b fix: prov for merged-elems (#1728) Peter W. J. Staar 2025-06-10 11:22:42 +02:00
  • e979750ce9 fix(tesseract): initialize df_osd to avoid uninitialized variable error (#1718) Maras Ioannis 2025-06-10 11:57:45 +03:00
  • f7f31137f1 fix: allow custom torch_dtype in vlm models (#1735) Michele Dolfi 2025-06-10 03:52:15 -05:00
  • 49b10e7419 docs: add open webui (#1734) Michele Dolfi 2025-06-10 02:35:20 -05:00
  • 9dbcb3d7d4 fix: Improve extraction from textboxes in Word docs (#1701) AndrewTsai0406 2025-06-06 17:37:46 +08:00
  • a2b83fe4ae fix: Add WEBP to the list of image file extensions (#1711) Eugene 2025-06-05 11:09:27 +04:00
  • 40df0d74ad chore: bump version to 2.36.1 [skip ci] github-actions[bot] 2025-06-04 11:43:13 +00:00
  • 8846f1a393 fix: remove typer and click constraints (#1707) Michele Dolfi 2025-06-04 13:06:23 +02:00
  • be42b03f9b docs: flash-attn usage and install (#1706) Michele Dolfi 2025-06-04 11:09:54 +02:00
  • 96c54dba91 chore: bump version to 2.36.0 [skip ci] github-actions[bot] 2025-06-03 13:54:25 +00:00
  • cdd401847a feat: simplify dependencies, switch to uv (#1700) Michele Dolfi 2025-06-03 15:18:54 +02:00
  • 61d0d6c755 test: mark flaky test (#1698) Panos Vagenas 2025-06-03 13:13:44 +02:00
  • cfdf4cea25 feat: new vlm-models support (#1570) Peter W. J. Staar 2025-06-02 17:01:06 +02:00
  • 08dcacc5cb chore: bump version to 2.35.0 [skip ci] github-actions[bot] 2025-06-02 12:30:26 +00:00
  • 11ca4f7a7b docs: fix typo in index.md (#1676) Edgar Hipp 2025-06-02 12:35:59 +02:00
  • 1c8a1283c4 test: ensure utf-8 in test data utils (#1691) Panos Vagenas 2025-06-02 12:13:19 +02:00
  • 984cb137f6 fix: guess HTML content starting with script tag (#1673) Cesar Berrospi Ramis 2025-06-02 08:43:24 +02:00
  • 3942923125 chore: fix or ignore runtime and deprecation warnings (#1660) Cesar Berrospi Ramis 2025-05-28 17:55:31 +02:00
  • b3e0042813 chore: exclude data from GH Linguist (#1671) Panos Vagenas 2025-05-28 15:42:34 +02:00
  • 106951e71e test: add missing ground truth files (#1667) Cesar Berrospi Ramis 2025-05-28 13:26:49 +02:00
  • b356b33059 feat: Add visualization of bbox on page with html export. (#1663) Peter W. J. Staar 2025-05-28 13:10:38 +02:00
  • 51d3450915 fix: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte (#1665) DavidLee 2025-05-27 20:06:05 +08:00
  • 2579d89510 chore: bump version to 2.34.0 [skip ci] github-actions[bot] 2025-05-22 18:44:45 +00:00
  • c2f595d283 fix: fix ZeroDivisionError for cell_bbox.area() (#1636) Said Gürbüz 2025-05-22 13:43:33 +02:00
  • 45265bf8b1 feat(ocr): auto-detect rotated pages in Tesseract (#1167) Clément Doumouro 2025-05-21 18:12:33 +02:00
  • 90875247e5 feat: Establish confidence estimation for document and pages (#1313) Christoph Auer 2025-05-21 12:32:49 +02:00
  • 14d4f5b109 fix(integration): update the Apify Actor integration (#1619) Václav Vančura 2025-05-21 02:47:55 +02:00
  • 84d0889829 chore: bump version to 2.33.0 [skip ci] github-actions[bot] 2025-05-20 19:54:51 +00:00
  • f4d9d4111b fix: Fix issue with detecting docx files, and files with upper case extensions (#1609) MoheyElDin Badr 2025-05-20 20:42:37 +03:00
  • 0e00a263fa fix: load_from_doctags static usage (#1617) Said Gürbüz 2025-05-20 15:06:12 +02:00
  • f2e9c0784c fix: incorrect force_backend_text behaviour for VLM DocTag pipelines (#1371) Krishnan 2025-05-20 13:29:38 +05:30
  • 98b5eeb844 fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549) Pedro Ribeiro 2025-05-19 14:26:00 +01:00
  • 12a0e64892 feat: add textbox content extraction in msword_backend (#1538) AndrewTsai0406 2025-05-19 21:01:36 +08:00
  • 7c4c356e76 chore: fix chunking example data link (#1596) Panos Vagenas 2025-05-16 08:44:47 +02:00
  • aeb0716bbb chore: bump version to 2.32.0 [skip ci] github-actions[bot] 2025-05-14 14:28:21 +00:00
  • 3a04f2a367 feat: Improve parallelization for remote services API calls (#1548) Vinay R Damodaran 2025-05-14 06:47:55 -07:00
  • 9f8b479f17 fix(ocr): orig field in TesseractOcrCliModel as str (#1553) jimkarag02 2025-05-14 16:05:52 +03:00
  • 9f28abf061 docs: add advanced chunking & serialization example (#1589) Panos Vagenas 2025-05-14 13:35:07 +01:00
  • 2efb7a7c06 fix(settings): fix nested settings load via environment variables (#1551) Alex Sokolov 2025-05-14 14:42:10 +03:00
  • 12dab0a1e8 feat: support image/webp file type (#1415) Elwin 2025-05-14 15:47:28 +08:00
  • 23238c241f chore: bump version to 2.31.2 [skip ci] github-actions[bot] 2025-05-13 10:09:19 +00:00
  • 4046d0b2f3 fix: AsciiDoc header identification (#1562) (#1563) Marco Fargetta 2025-05-13 11:17:26 +02:00
  • 8baa85a49d fix: restrict click version and update lock file (#1582) Michele Dolfi 2025-05-13 10:40:08 +02:00
  • 0d0fa6cbe3 chore: bump version to 2.31.1 [skip ci] github-actions[bot] 2025-05-12 09:44:26 +00:00
  • 127e38646f fix: add smoldocling in download utils (#1577) Michele Dolfi 2025-05-12 10:48:07 +02:00
  • 844babb390 docs: update links in data_prep_kit (#1559) Oleg Lavrovsky 2025-05-11 20:38:25 +02:00
  • 776e7ecf9a fix(HTML): handle row spans in header rows (#1536) Cesar Berrospi Ramis 2025-05-09 15:14:32 +02:00
  • 3220a592e7 docs: add serialization docs, update chunking docs (#1556) Panos Vagenas 2025-05-08 21:43:01 +02:00
  • f1658edbad fix: mime error in document streams (#1523) DavidLee 2025-05-06 15:30:46 +08:00
  • 7c705739f9 fix: usage of hashlib for FIPS (#1512) Michele Dolfi 2025-05-02 15:03:29 +02:00
  • de56523974 chore: format JSON test files to enable comparison (#1511) Panos Vagenas 2025-05-02 11:52:18 +03:00
  • b147331f2a chore: restore typing hint for self.script_readers (#1500) Ihar Hrachyshka 2025-04-30 14:33:27 -04:00
  • 4ab7e9ddfb fix: Guard against attribute errors in TesseractOcrModel __del__ (#1494) Ben Browning 2025-04-30 11:51:33 -04:00
  • cc453961a9 fix: enable cuda_use_flash_attention2 for PictureDescriptionVlmModel (#1496) Zach Cox 2025-04-30 02:02:52 -04:00
  • 976e92e289 fix: updated the time-recorder label for reading order (#1490) Peter W. J. Staar 2025-04-29 13:02:53 +02:00
  • d8959c6b19 chore: update dependencies in lock file (#1458) Michele Dolfi 2025-04-28 08:52:46 +02:00
  • a097ccd8d5 chore: typo fix (#1465) nkh0472 2025-04-28 14:52:09 +08:00
  • 3afbe6c969 docs: update supported formats guide (#1463) Emmanuel Ferdman 2025-04-28 09:51:54 +03:00
  • 94d66a0765 fix: Incorrect scaling of TableModel bboxes when do_cell_matching is False (#1459) Maxim Lysak 2025-04-25 12:34:12 +02:00
  • c67133dde4 chore: bump version to 2.31.0 [skip ci] github-actions[bot] 2025-04-25 08:28:25 +00:00
  • a2fbbba9f7 feat: add tutorial using Milvus and Docling for RAG pipeline (#1449) Ryan Lin 2025-04-25 03:12:35 -04:00
  • 976431ed7f chore: update locked deps (#1442) Michele Dolfi 2025-04-23 14:59:31 +02:00
  • ed20124544 fix(html): handle address, details, and summary tags (#1436) Cesar Berrospi Ramis 2025-04-23 09:30:59 +02:00
  • c2470ed216 docs: Fix wrong output format in example code (#1427) nkh0472 2025-04-22 18:32:55 +08:00
  • 64918a81ac docs: Add OpenSSF Best Practices badge (#1430) Michele Dolfi 2025-04-22 11:23:28 +02:00
  • 995b3b0ab1 docs: Typo fixes in docling_document.md (#1400) Ben Cox 2025-04-22 07:49:08 +01:00