pull/58/merge
Jacobrakai 1 year ago committed by GitHub
commit 2df4f4e7a7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,22 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/python
{
"name": "Python 3",
// Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile
"image": "mcr.microsoft.com/devcontainers/python:0-3.11"
// Features to add to the dev container. More info: https://containers.dev/features.
// "features": {},
// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],
// Use 'postCreateCommand' to run commands after the container is created.
// "postCreateCommand": "pip3 install --user -r requirements.txt",
// Configure tool-specific properties.
// "customizations": {},
// Uncomment to connect as root instead. More info: https://aka.ms/dev-containers-non-root.
// "remoteUser": "root"
}

2
.gitignore vendored

@ -0,0 +1,2 @@
.vscode
.devcontainer

@ -0,0 +1,21 @@
## MIT License
**Copyright (c) [2023] [multiple]**
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
**THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.**

@ -1,42 +1,44 @@
# pdfGPT: Improved PDF Text Chatbot using Universal Sentence Encoder and Open AI
# pdfGPT
### Problem Description :
1. When you pass a large text to Open AI, it suffers from a 4K token limit. It cannot take an entire pdf file as an input
2. Open AI sometimes becomes overtly chatty and returns irrelevant response not directly related to your query. This is because Open AI uses poor embeddings.
3. ChatGPT cannot directly talk to external data. Some solutions use Langchain but it is token hungry if not implemented correctly.
4. There are a number of solutions like https://www.chatpdf.com, https://www.bespacific.com/chat-with-any-pdf, https://www.filechat.io they have poor content quality and are prone to hallucination problem. One good way to avoid hallucinations and improve truthfulness is to use improved embeddings. To solve this problem, I propose to improve embeddings with Universal Sentence Encoder family of algorithms (Read more here: https://tfhub.dev/google/collections/universal-sentence-encoder/1).
### Solution: What is PDF GPT ?
1. PDF GPT allows you to chat with an uploaded PDF file using GPT functionalities.
2. The application intelligently breaks the document into smaller chunks and employs a powerful Deep Averaging Network Encoder to generate embeddings.
3. A semantic search is first performed on your pdf content and the most relevant embeddings are passed to the Open AI.
4. A custom logic generates precise responses. The returned response can even cite the page number in square brackets([]) where the information is located, adding credibility to the responses and helping to locate pertinent information quickly. The Responses are much better than the naive responses by Open AI.
5. Andrej Karpathy mentioned in this post that KNN algorithm is most appropriate for similar problems: https://twitter.com/karpathy/status/1647025230546886658
pdfGPT is a solution that allows users to chat with an uploaded PDF file using GPT functionalities. It solves many problems associated with the use of Open AI for text analysis.
## Problem Description
1. Limitations of Open AI due to 4K token limit, which does not allow it to take an entire pdf file as input.
2. Open AI sometimes becomes overtly chatty and returns irrelevant responses not related to user queries due to poor embeddings.
3. ChatGPT is unable to talk directly to external data without using token-hungry LangChain implementations.
4. Poor quality and hallucinations in solutions like https://www.chatpdf.com, https://www.bespacific.com/chat-with-any-pdf, and https://www.filechat.io. Improved embeddings can be used to avoid such issues.
To solve these problems, improved embeddings are generated with the Universal Sentence Encoder family of algorithms (Read more here: https://tfhub.dev/google/collections/universal-sentence-encoder/1).
## Solution
1. The application intelligently breaks down documents into smaller chunks and employs a powerful Deep Averaging Network Encoder to generate embeddings.
2. A semantic search is performed on your PDF content, and the most relevant embeddings are passed to Open AI.
3. A custom logic generates precise responses. The returned response can even cite the page number in square brackets([]) where the information is located, adding credibility to the responses and helping to locate pertinent information quickly.
4. The responses are much better than the naive responses generated by Open AI.
5. According to a tweet by Andrej Karpathy (https://twitter.com/karpathy/status/1647025230546886658), KNN algorithm is most appropriate for similar problems.
6. Enables APIs on Production using **[langchain-serve](https://github.com/jina-ai/langchain-serve)**.
### Demo
1. **Demo URL**: https://bit.ly/41ZXBJM
2. **Original Source code** (for demo hosted in Hugging Face) : https://huggingface.co/spaces/bhaskartripathi/pdfChatter/blob/main/app.py
**NOTE**: Please star this project if you like it!
Please star this project if you like it!
### Docker
Run `docker-compose -f docker-compose.yaml up` to use it with Docker compose.
## Use `pdfGPT` on Production using [langchain-serve](https://github.com/jina-ai/langchain-serve)
#### Local playground
### Local playground
1. Run `lc-serve deploy local api` on one terminal to expose the app as API using langchain-serve.
2. Run `python app.py` on another terminal for a local gradio playground.
3. Open `http://localhost:7860` on your browser and interact with the app.
### Cloud Deployment
Make `pdfGPT` production-ready by deploying it on [Jina Cloud](https://cloud.jina.ai/).
#### Cloud deployment
Make `pdfGPT` production ready by deploying it on [Jina Cloud](https://cloud.jina.ai/).
`lc-serve deploy jcloud api`
Run `lc-serve deploy jcloud api`
<details>
<summary>Show command output</summary>
@ -59,7 +61,7 @@ Make `pdfGPT` production ready by deploying it on [Jina Cloud](https://cloud.jin
</details>
#### Interact using cURL
### Interact using cURL
(Change the URL to your own endpoint)
@ -149,7 +151,7 @@ sequenceDiagram
System-->>User: Return Answer
```
### Flowchart
## Flowchart
```mermaid
flowchart TB
A[Input] --> B[URL]
@ -162,13 +164,11 @@ G -- K-Nearest Neighbour --> K[Get Nearest Neighbour - matching citation referen
K -- Generate Prompt --> H[Generate Answer]
H -- Output --> I[Output]
```
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=bhaskatripathi/pdfGPT&type=Date)](https://star-history.com/#bhaskatripathi/pdfGPT&Date)
I am looking for more contributors from the open source community who can take up backlog items voluntarily and maintain the application jointly with me.
## Also Try:
This app creates schematic architecture diagrams, UML, flowcharts, Gantt charts and many more. You simple need to mention the usecase in natural language and it will create the desired diagram.
https://github.com/bhaskatripathi/Text2Diagram
I am looking for more contributors from the open source community who can take up backlog items voluntarily and maintain the application jointly with me.
Also try [Text2Diagram](https://github.com/bhaskatripathi/Text2Diagram), an app that creates schematic architecture diagrams, UML, flowcharts, Gantt charts, using natural language input.

@ -1,10 +1,10 @@
# import libraries
import os
import re
import shutil
import urllib.request
from pathlib import Path
from tempfile import NamedTemporaryFile
import fitz
import numpy as np
import openai
@ -13,6 +13,8 @@ from fastapi import UploadFile
from lcserve import serving
from sklearn.neighbors import NearestNeighbors
# download pdf from given url
recommender = None
@ -21,21 +23,29 @@ def download_pdf(url, output_path):
urllib.request.urlretrieve(url, output_path)
# preprocess text
def preprocess(text):
text = text.replace('\n', ' ')
text = re.sub('\s+', ' ', text)
text = text.replace("\n", " ")
text = re.sub("\s+", " ", text)
return text
# convert pdf to text list
def pdf_to_text(path, start_page=1, end_page=None):
doc = fitz.open(path)
total_pages = doc.page_count
# if end page is not specified set it to total pages
if end_page is None:
end_page = total_pages
text_list = []
# loop through all the pages and get the text
for i in range(start_page - 1, end_page):
text = doc.load_page(i).get_text("text")
text = preprocess(text)
@ -45,14 +55,19 @@ def pdf_to_text(path, start_page=1, end_page=None):
return text_list
# convert text list to chunks of words with page numbers
def text_to_chunks(texts, word_length=150, start_page=1):
text_toks = [t.split(' ') for t in texts]
text_toks = [t.split(" ") for t in texts]
page_nums = []
chunks = []
# loop through each word and create chunks
for idx, words in enumerate(text_toks):
for i in range(0, len(words), word_length):
chunk = words[i : i + word_length]
# if last chunk is smaller than word length and not last page then add it to next page
if (
(i + word_length) > len(words)
and (len(chunk) < word_length)
@ -60,17 +75,21 @@ def text_to_chunks(texts, word_length=150, start_page=1):
):
text_toks[idx + 1] = chunk + text_toks[idx + 1]
continue
chunk = ' '.join(chunk).strip()
chunk = f'[Page no. {idx+start_page}]' + ' ' + '"' + chunk + '"'
chunk = " ".join(chunk).strip()
chunk = f'[Page no. {idx + start_page}] "{chunk}"'
chunks.append(chunk)
return chunks
# semantic search class
class SemanticSearch:
def __init__(self):
self.use = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4')
self.use = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
self.fitted = False
# fit the data
def fit(self, data, batch=1000, n_neighbors=5):
self.data = data
self.embeddings = self.get_text_embedding(data, batch=batch)
@ -79,23 +98,24 @@ class SemanticSearch:
self.nn.fit(self.embeddings)
self.fitted = True
# call the model
def __call__(self, text, return_data=True):
inp_emb = self.use([text])
neighbors = self.nn.kneighbors(inp_emb, return_distance=False)[0]
if return_data:
return [self.data[i] for i in neighbors]
else:
return neighbors
return [self.data[i] for i in neighbors] if return_data else neighbors
# get text embedding
def get_text_embedding(self, texts, batch=1000):
embeddings = []
for i in range(0, len(texts), batch):
text_batch = texts[i : (i + batch)]
emb_batch = self.use(text_batch)
embeddings.append(emb_batch)
embeddings = np.vstack(embeddings)
return embeddings
return np.vstack(embeddings)
# load recommender
def load_recommender(path, start_page=1):
@ -106,7 +126,10 @@ def load_recommender(path, start_page=1):
texts = pdf_to_text(path, start_page=start_page)
chunks = text_to_chunks(texts, start_page=start_page)
recommender.fit(chunks)
return 'Corpus Loaded.'
return "Corpus Loaded."
# generate text using openAI
def generate_text(openAI_key, prompt, engine="text-davinci-003"):
@ -119,16 +142,17 @@ def generate_text(openAI_key, prompt, engine="text-davinci-003"):
stop=None,
temperature=0.7,
)
message = completions.choices[0].text
return message
return completions.choices[0].text
# generate answer for a given question
def generate_answer(question, openAI_key):
topn_chunks = recommender(question)
prompt = ""
prompt += 'search results:\n\n'
prompt = "" + "search results:\n\n"
for c in topn_chunks:
prompt += c + '\n\n'
prompt += c + "\n\n"
prompt += (
"Instructions: Compose a comprehensive reply to the query using the search results given. "
@ -142,8 +166,15 @@ def generate_answer(question, openAI_key):
)
prompt += f"Query: {question}\nAnswer:"
answer = generate_text(openAI_key, prompt, "text-davinci-003")
return answer
return generate_text(openAI_key, prompt, "text-davinci-003")
# global instance of semantic search
recommender = SemanticSearch()
# load openAI key
def load_openai_key() -> str:
@ -155,14 +186,20 @@ def load_openai_key() -> str:
return key
# ask url
@serving
def ask_url(url: str, question: str):
download_pdf(url, 'corpus.pdf')
load_recommender('corpus.pdf')
download_pdf(url, "corpus.pdf")
load_recommender("corpus.pdf")
openAI_key = load_openai_key()
return generate_answer(question, openAI_key)
# ask file
@serving
async def ask_file(file: UploadFile, question: str) -> str:
suffix = Path(file.filename).suffix

138
app.py

@ -1,94 +1,132 @@
import json
from tempfile import _TemporaryFileWrapper
import json # importing the JSON module for encoding and decoding data
import requests # importing the Requests library for making HTTP requests
import gradio as gr # importing the Gradio library for building web interfaces
import gradio as gr
import requests
# Define a function named ask_api that accepts 5 parameters -
# lcserve_host, url, file, question, openAI_key - and returns a string
def ask_api(
lcserve_host: str,
url: str,
file: _TemporaryFileWrapper,
file,
question: str,
openAI_key: str,
) -> str:
if not lcserve_host.startswith('http'):
return '[ERROR]: Invalid API Host'
# Check if lcserve_host starts with "http"
if not lcserve_host.startswith("http"):
# Throw an exception if lcserve_host is invalid
raise ValueError("Invalid API Host")
if url.strip() == '' and file == None:
return '[ERROR]: Both URL and PDF is empty. Provide atleast one.'
# If neither url nor file is provided, throw an exception
if not any([url.strip(), file]):
raise ValueError("Either URL or PDF should be provided.")
if url.strip() != '' and file != None:
return '[ERROR]: Both URL and PDF is provided. Please provide only one (eiter URL or PDF).'
# If both url and file are provided, throw an exception
if all([url.strip(), file]):
raise ValueError("Both URL and PDF are provided. Please provide only one.")
if question.strip() == '':
return '[ERROR]: Question field is empty'
# If question field is empty, throw an exception
if not question.strip():
raise ValueError("Question field is empty.")
# Create a dictionary _data with two keys "question" and "envs"
_data = {
'question': question,
'envs': {
'OPENAI_API_KEY': openAI_key,
},
"question": question,
"envs": {"OPENAI_API_KEY": openAI_key},
}
if url.strip() != '':
r = requests.post(
f'{lcserve_host}/ask_url',
json={'url': url, **_data},
)
# If url is provided, make a POST request to "lcserve_host"/ask_url route with data _data
if url.strip():
r = requests.post(f"{lcserve_host}/ask_url", json={"url": url, **_data})
# Otherwise open the file in binary mode and make a POST request to "lcserve_host"/ask_file route with data _data and the file
else:
with open(file.name, 'rb') as f:
with open(file.name, "rb") as f:
r = requests.post(
f'{lcserve_host}/ask_file',
params={'input_data': json.dumps(_data)},
files={'file': f},
f"{lcserve_host}/ask_file",
params={"input_data": json.dumps(_data)},
files={"file": f},
)
if r.status_code != 200:
raise ValueError(f'[ERROR]: {r.text}')
try:
# Raise an HTTPError if one occurs while making a request to the server
r.raise_for_status()
except requests.exceptions.HTTPError as e:
raise ValueError( # Throw a ValueError if the request fails
f"Request failed with status code {r.status_code}: {e}"
) from e
return r.json()['result']
# Return the value of the "result" key in the JSON response
return r.json()["result"]
title = 'PDF GPT'
# Define variables title and description which describe our Gradio interface
title = "PDF GPT"
description = """ PDF GPT allows you to chat with your PDF file using Universal Sentence Encoder and Open AI. It gives hallucination free response than other tools as the embeddings are better than OpenAI. The returned response can even cite the page number in square brackets([]) where the information is located, adding credibility to the responses and helping to locate pertinent information quickly."""
# Define a Gradio Blocks object named demo
with gr.Blocks() as demo:
gr.Markdown(f'<center><h1>{title}</h1></center>')
# Add a Markdown heading and description to the Gradio interface
gr.Markdown(f"<center><h1>{title}</h1></center>")
gr.Markdown(description)
# Create two side-by-side Groups for input fields and outputs
with gr.Row():
with gr.Group():
# Add a Textbox widget to accept the API host URL from the user
lcserve_host = gr.Textbox(
label='Enter your API Host here',
value='http://localhost:8080',
placeholder='http://localhost:8080',
label="Enter your API Host here",
value="http://localhost:8080",
placeholder="http://localhost:8080",
)
# Add a link to the OpenAI API key webpage and a Password textbox to get the user's API Key
gr.Markdown(
f'<p style="text-align:center">Get your Open AI API key <a href="https://platform.openai.com/account/api-keys">here</a></p>'
'<p style="text-align:center">Get your Open AI API key <a href="https://platform.openai.com/account/api-keys">here</a></p>'
)
openAI_key = gr.Textbox(
label='Enter your OpenAI API key here', type='password'
label="Enter your OpenAI API key here", type="password"
)
pdf_url = gr.Textbox(label='Enter PDF URL here')
# Add a Text box that allows users to enter URL of the PDF file they want to chat with
pdf_url = gr.Textbox(label="Enter PDF URL here")
# Add a File Upload widget so that users can upload their PDF/Research Paper/Book
gr.Markdown("<center><h4>OR<h4></center>")
file = gr.File(
label='Upload your PDF/ Research Paper / Book here', file_types=['.pdf']
label="Upload your PDF/ Research Paper / Book here", file_types=[".pdf"]
)
question = gr.Textbox(label='Enter your question here')
btn = gr.Button(value='Submit')
btn.style(full_width=True)
with gr.Group():
answer = gr.Textbox(label='The answer to your question is :')
# Add a field for the user to enter their question
question = gr.Textbox(label="Enter your question here")
btn.click(
ask_api,
inputs=[lcserve_host, pdf_url, file, question, openAI_key],
outputs=[answer],
)
# Add a submit button for user to trigger their API request
btn = gr.Button(value="Submit")
btn.style(full_width=True)
demo.app.server.timeout = 60000 # Set the maximum return time for the results of accessing the upstream server
demo.launch(server_port=7860, enable_queue=True) # `enable_queue=True` to ensure the validity of multi-user requests
# Add another group for the output area where the answer will be shown
with gr.Group():
answer = gr.Textbox(label="The answer to your question is :")
# Define function onclick() which will be called when the user clicks the "submit" button
def on_click():
try:
# Call the ask_api function and update the answer in the Gradio UI
ans = ask_api(
lcserve_host.value,
pdf_url.value,
file,
question.value,
openAI_key.value,
)
answer.update(str(ans))
except ValueError as e:
# Update the response with an error message if an error occurs during the API call
answer.update(f"[ERROR]: {str(e)}")
btn.click(on_click)
# Launch the Gradio interface on port number 7860
demo.launch(server_port=7860)

@ -1,8 +1,15 @@
PyMuPDF
numpy
scikit-learn
tensorflow>=2.0.0
tensorflow-hub
openai==0.10.2
gradio
# Machine learning libraries
numpy>=1.17.0
scikit-learn>=0.22.0
tensorflow-macos>=2.0.0,<3.0.0
tensorflow-hub>=0.9.0,<1.0.0
# Natural language processing libraries
PyMuPDF>=1.18.13
openai>=0.10.2,<0.11
# User interface library
gradio>=1.4.0
# Language detection server
langchain-serve>=0.0.19

Loading…
Cancel
Save