Intro
For those unfamiliar with it, LangChain is a framework for developing applications that make use of LLMs.
As the name suggests, LangChain is based on the concept of LLM Chain, which combines 3 elements:
- Prompt Templates: they refer to a reproducible way to generate a prompt. Contains a text string (“the model”), which can accept a series of parameters from the end user and generates the definitive prompt which is passed as input to the model
- The language model (LLM): LangChain integrates with the most important providers (OpenAI, Cohere, Hugging Face, etc)
- Output Parsers: allow to extract structure data form from the answers returned by the linguistic model
The framework has 2 very interesting features:
- it can extend LLM knowledge with your own database, leveragine structured and unstructured datasets
- it provides the “Agent” capabilities by which the action itself is an output returned from the LLM
I was quite curious about the first item, so I’ve started making some tests. I don’t want to make a critical analysis of the model performance, but rather verify how easy is to integrate the framework into one’s own database.
Integration with unstructured data
I didn’t know where to start, so I took a look at the most documented use cases on the internet. I’ve found a lot of documentation related to parsing PDF files, so it seemed like an area I could experiment with a lot.
In the official documentation there’s a special section related to the “Data Connection”, which I found incredibly clear and intuitive. I will try to summarize here the most important points.
The building blocks made available by LangChain are the following:
- Document: it’s an abstraction containing both the data in textual form and the associated metadata
- Document loaders: They are classes that allow you to extract text and metadata from a specific type of data in order to build the “Document”
- Document transformers: it’s used to process Documents. Since LLMs usually have strong limitations in terms of available tokens, the most common transformation is related to chunk splitting, through which it is possible to submit calls to the LLM provider in series or in parallel. There are also other types of transformers, for example: redundancy reduction, translation, metadata extraction, etc…
- Text embedding: it’s the operation of translating a portion of text into an N-dimensional vector model, which is the core component for the semantic search operations based on similarity indexes and implemented by calculating vector distances across such N-dimensional space
- Vector stores: it stores the embeddings inside a vector DB Engine, which is capable of efficiently returning the vectors closest to the input text (and therefore the portions of text that are most similar). It’s possible to exploit some open source DB engines to run everything locally, or to integrate with some market products that obviously offer much better performance (eg: Pinecone)
- Retrievers: it’s an interface that returns documents from an unstructured query. It is a slightly more general concept than a Vector Store, but unlike the latter, it only allows you to return documents and not necessarily store them
Chains
So let’s talk about the main components: the chains.
LangChain introduces this concept which represents a useful abstraction to implement applications that make use of LLMs in a simple and modular way. There are many predefined Chains, the most common are:
- RetrievalQA: it responds to user input from the output returned by a retriever
- ConversationalRetrievalChain: it’s similar to RetrievalQA. It adds the capability to build a conversational experience through the history of exchanged messages
- Summarize: as the name suggests, it enable text summarization
The experiment
I took a 2017 research paper, written by some researchers at the Oak Ridge National Laboratory (ORNL) and other university institutes, which proposes an implementation of a quantum computing algorithm for a Portfolio Optimization problem.
In particular, the article describes the advantages deriving from the use of a variant of the Markowitz model (QUBO) on D-Wave type quantum devices.
The complete article can be found at this link.
Being passionate about these topics, but not having a solid theoretical basis, I can understand the main points of the paper, but I have no competence to evaluate the reliability or the goodness of the results, so I decide to ask OpenAI for a critical analysis, passing through LangChain.
Surprisingly, it only took me a few hours and less than 20 lines of code to get a working prototype with an overall good result.
The code
Here you can find the source code. It’s almost self-describing, but I’m adding some further notes and comments below.
from langchain.llms import OpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.chains.summarize import load_summarize_chain
from langchain import OpenAI, PromptTemplate
from dotenv import load_dotenv
load_dotenv()
loader = PyPDFLoader("docs/pdf/102.pdf")
docs = []
docs.extend(loader.load())
prompt_template = """Write a negative critique of this research article, questioning its findings and applicability:
{text}
CRITICS:"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(OpenAI(temperature=0.5), chain_type="map_reduce", return_intermediate_steps=False, map_prompt=PROMPT, combine_prompt=PROMPT)
chain({"input_documents": docs}, return_only_outputs=True)
The call to load_dotenv()
is for loading the OPENAI_API_KEY environment variable, associated with my OpenAI account.
The first instructions are used to initialize the loader and create the associated “Documents”. Right after that, I define the prompt template and call the “Summarize” chain.
The first parameter passed to the Chain is an OpenAI-type LLM instantiation, which I deliberately configured with a temperature of 0.5, to push the creativity a little bit. The chain_type is “map_reduce” because the text is long enough to exceed the 4096 token limit imposed by the provider I’m using (ChatGPT 3.5).
Outcomes
Please find below a couple of outcomes I got from the above code:
This research article presents a variety of methodologies for analyzing portfolio selection in the presence of multiple criteria. However, the results are based on mathematical theories and simulation models rather than real data. Also, using a fixed $100 budget is unrealistic and does not represent real-world conditions. Furthermore, most of the techniques described in this article are not applicable to real financial markets, as they do not take into account factors such as market volatility, economic cycles, etc. Consequently, the findings and applicability of this research article are questioned.
This research article does not provide sufficient evidence of the results obtained, nor does it demonstrate the applicability of the proposed model in practice. Markowitz’s theory of portfolio selection is not applicable to all investors, as it does not consider the complications that can arise in practice, such as taxes or fees. Also, the use of the D-Wave quantum optimizer has not been adequately demonstrated. The relaxed wording does not involve buying shares, which means that the optimal outcome is not realistic. Also, the abundance of zeros in the data is unrealistic. Most of the sample code and applications mentioned in the article are computation related
Conclusions
As I said earlier, I can’t claim any strong subject mastership for evaluate if the outcome i got from the LLM have solid foundations or not, but at first sight they seem to be realistic and strong enough to suggest some thoughts.
For example, if we take the comment about the investment of only $100, this is actually a simplified scenario they considered in the paper, but to be honest I have no idea if this factor can effectively question the results.
In general, the thing that amazed me is the ease with which the framework makes the building blocks available for developing AI applications, without reinventing the wheel and integrating very well with the main providers and market products.
I realize the example shown is really trivial, but it opens up a world of possibilities. I’m doing other tests by expanding the dataset and trying to answer slightly more complex questions. Stay tuned