"A picture is worth a thousand words. (one image token >> one text token)"
With a high compression ratio, our 256 image tokens can encode a multi-column PDF page with no loss of information. Further, our 2048 image tokens can efficiently handle 8-page PDF document. Then, based on this feature, we forge an amazing "reading pen", termed as Fox.
Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages. Accordingly, this paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents.
We introduce a novel task to boost the document understanding by making LVLMs focus attention on the document-level region, such as redefining full-page OCR as foreground focus. We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages (e.g., a page containing a photo). Meanwhile, we render cross-vocabulary vision data as the catalyzer to achieve a full reaction of multiple visual vocabularies and in-document figure understanding. Further, without modifying the weights of multiple vision vocabularies, the above catalyzed fine-grained understanding capabilities can be efficiently tuned to multi-page documents, enabling the model to focus anywhere in both format-free and page-free manners.
Besides, we build a benchmark including 9 fine-grained sub-tasks (e.g., region-level OCR/summary/translation, color-guided OCR, cross-page VQA, multi-page OCR) to promote document analysis in the community. The experimental results verify the superiority of our model.
Fox can give precise responses when focusing on the 8-page document. These pages contain bilingual content, have well over a thousand characters per page, and have a variety of single/multi-column layouts. This extreme case demonstrates powerful focusing capabilities.
The left case shows
Fox can handle the cross-page VQA task on the multi-page (8 pages as an example) document.
The right case shows
Fox can perform the dense Chinese text recognition by foreground focus and obtain precise results.
The proposed Fox easily performs dense English text recognition by foreground focus.
Fox can achieve text-associative in-page figure caption (see "young Dual Language Learners") and fine-grained document understanding.
Fox enjoys high flexibility and robustness when performing fine-grained region-level translation/summary/OCR tasks in multi-column documents.
Fox can focus on the in-document figure and recognize this map of "global seismic hazards".
Fox can perform interesting VQA in the cartoon book.
Of course, Fox can yield interesting results in cartoon and natural scenes.
Beyond the conventional single-page sparse VQA, we build a fine-grained benchmark for dense document-level understanding. This benchmark includes 9 challenging tasks and multi-grained questions:
1) page OCR: OCR this image.
2) region-level OCR: Give the OCR results of the box [x1, y1, x2, y2].
3) line-level OCR: OCR the line [x0, y0].
4) color-guided OCR: OCR red/green/blue box.
5) region-level translation: OCR this page.
6) region-level summary: OCR this page.
7) in-documen figure caption: What is this in the box [x1, y1, x2, y2]?
8) multi-page multi-region OCR: OCR boxes on multiple pages. Page 1: [box1], Page 2: [box2], Page 3: [box3], Page 4: [box4], Page 5: [box5], Page 6: [box6], Page 7: [box7], Page 8: [box8].
9) cross-page VQA: Which page's box contains more characters? Page 1: [box1], Page 2: [box2], Page 3: [box3], Page 4: [box4], Page 5: [box5], Page 6: [box6], Page 7: [box7], Page 8: [box8].
The above tasks are mainly bulit on the 112 English pages and 100 Chinese PDF pages from the InterNet. Each page contains more than 1000 characters. We will continue to develop this benchmark.
Some examples for 9 fine-grained sub-tasks (no cropping patches! ) in Fox.
(All images are from the InterNet. If you have any questions, please email to us.)
Dense English PDF page.
Color-guided OCR on Chinese PDF page.
In-document figure caption on our rendered interleaved page.
In-document figure caption on our rendered interleaved page.
@article{liu2024focus,
title={Focus Anywhere for Fine-grained Multi-page Document Understanding},
author={Liu, Chenglong and Wei, Haoran and Chen, Jinyue and Kong, Lingyu and Ge, Zheng and Zhu, Zining and Zhao, Liang and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
journal={arXiv preprint arXiv:2405.14295},
year={2024}
}