You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
75 lines
2.2 KiB
Python
75 lines
2.2 KiB
Python
r"""°°°
|
|
# Document Loaders
|
|
|
|
- loading text from local sources
|
|
- main driver is `Unstructured` python package
|
|
|
|
## Key Concepts
|
|
|
|
### Document
|
|
|
|
container class for document information. contains:
|
|
- page_content
|
|
- metadata
|
|
|
|
### Loader
|
|
|
|
base class to load documents. exposes:
|
|
- load() -> Document
|
|
|
|
|
|
## Setup Unstructured
|
|
- host dependencies
|
|
- poppler: PDF rendering library
|
|
- Python deps:
|
|
- Pillow: imaging library
|
|
°°°"""
|
|
#|%%--%%| <4yTe29l2Ya|srwyN0cVES>
|
|
|
|
# %pip install pillow (already installed)
|
|
%pip install -q unstructured[local-inference]
|
|
|
|
#|%%--%%| <srwyN0cVES|cbFv0eSeXq>
|
|
|
|
docs_dir="unstructured-examples"
|
|
!mkdir -p $docs_dir
|
|
!wget https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html -P $docs_dir
|
|
!wget https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P $docs_dir
|
|
|
|
|
|
#|%%--%%| <cbFv0eSeXq|U633RkWjYq>
|
|
r"""°°°
|
|
[repo link](https://github.com/Unstructured-IO/unstructured#coffee-getting-started)
|
|
The easiest way to parse a document in unstructured is to use the partition brick. If you use partition brick, unstructured will detect the file type and route it to the appropriate file-specific partitioning brick. If you are using the partition brick, ensure you first install libmagic using the instructions outlined here partition will always apply the default arguments. If you need advanced features, use a document-specific brick. The partition brick currently works for .txt, .docx, .pptx, .jpg, .png, .eml, .html, and .pdf documents.
|
|
|
|
Requires detectonr2 inference (cuda ?)
|
|
°°°"""
|
|
#|%%--%%| <U633RkWjYq|FJaYuFeL0U>
|
|
|
|
docs_dir="unstructured-examples"
|
|
#|%%--%%| <FJaYuFeL0U|9MKaXz7Bi4>
|
|
|
|
#NOTE: needs inference with facebook's detectron2
|
|
|
|
# from unstructured.partition.auto import partition
|
|
|
|
# elements = partition(docs_dir + "/layout-parser-paper.pdf")
|
|
|
|
|
|
#|%%--%%| <9MKaXz7Bi4|X4mTVZAzcD>
|
|
r"""°°°
|
|
## Unstructured Langchain FileLoader
|
|
|
|
Requires detectron2
|
|
|
|
°°°"""
|
|
#|%%--%%| <X4mTVZAzcD|9k0eAtsfvh>
|
|
|
|
from langchain.document_loaders import UnstructuredFileLoader
|
|
|
|
loader = UnstructuredFileLoader("./unstructured-examples/layout-parser-paper.pdf")
|
|
|
|
docs = loader.load()
|
|
#|%%--%%| <9k0eAtsfvh|1lKP9jNDd4>
|
|
|