Selaa lähdekoodia

[Improvements] Add support for creating app from YAML string config (#980)

Deven Patel 1 vuosi sitten
vanhempi
commit
406c46e7f4

+ 121 - 20
docs/advanced/configuration.mdx

@@ -6,15 +6,16 @@ Embedchain is made to work out of the box. However, for advanced users we're als
 
 You can configure different components of your app (`llm`, `embedding model`, or `vector database`) through a simple yaml configuration that Embedchain offers. Here is a generic full-stack example of the yaml config:
 
-```yaml
+
+<Tip>
+Embedchain applications are configurable using YAML file, JSON file or by directly passing the config dictionary.
+</Tip>
+
+<CodeGroup>
+```yaml config.yaml
 app:
   config:
-    id: 'full-stack-app'
-
-chunker:
-  chunk_size: 100
-  chunk_overlap: 20
-  length_function: 'len'
+    name: 'full-stack-app'
 
 llm:
   provider: openai
@@ -47,38 +48,138 @@ embedder:
   provider: openai
   config:
     model: 'text-embedding-ada-002'
+
+chunker:
+  chunk_size: 2000
+  chunk_overlap: 100
+  length_function: 'len'
+```
+
+```json config.json
+{
+  "app": {
+    "config": {
+      "name": "full-stack-app"
+    }
+  },
+  "llm": {
+    "provider": "openai",
+    "config": {
+      "model": "gpt-3.5-turbo",
+      "temperature": 0.5,
+      "max_tokens": 1000,
+      "top_p": 1,
+      "stream": false,
+      "template": "Use the following pieces of context to answer the query at the end.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n$context\n\nQuery: $query\n\nHelpful Answer:",
+      "system_prompt": "Act as William Shakespeare. Answer the following questions in the style of William Shakespeare."
+    }
+  },
+  "vectordb": {
+    "provider": "chroma",
+    "config": {
+      "collection_name": "full-stack-app",
+      "dir": "db",
+      "allow_reset": true
+    }
+  },
+  "embedder": {
+    "provider": "openai",
+    "config": {
+      "model": "text-embedding-ada-002"
+    }
+  },
+  "chunker": {
+    "chunk_size": 2000,
+    "chunk_overlap": 100,
+    "length_function": "len"
+  }
+}
 ```
 
+```python config.py
+config = {
+    'app': {
+        'config': {
+            'name': 'full-stack-app'
+        }
+    },
+    'llm': {
+        'provider': 'openai',
+        'config': {
+            'model': 'gpt-3.5-turbo',
+            'temperature': 0.5,
+            'max_tokens': 1000,
+            'top_p': 1,
+            'stream': False,
+            'template': (
+                "Use the following pieces of context to answer the query at the end.\n"
+                "If you don't know the answer, just say that you don't know, don't try to make up an answer.\n"
+                "$context\n\nQuery: $query\n\nHelpful Answer:"
+            ),
+            'system_prompt': (
+                "Act as William Shakespeare. Answer the following questions in the style of William Shakespeare."
+            )
+        }
+    },
+    'vectordb': {
+        'provider': 'chroma',
+        'config': {
+            'collection_name': 'full-stack-app',
+            'dir': 'db',
+            'allow_reset': True
+        }
+    },
+    'embedder': {
+        'provider': 'openai',
+        'config': {
+            'model': 'text-embedding-ada-002'
+        }
+    },
+    'chunker': {
+        'chunk_size': 2000,
+        'chunk_overlap': 100,
+        'length_function': 'len'
+    }
+}
+```
+</CodeGroup>
+
 Alright, let's dive into what each key means in the yaml config above:
 
 1. `app` Section:
     - `config`:
-        - `id` (String): The ID or name of your full-stack application.
-2. `chunker` Section:
-    - `chunk_size` (Integer): The size of each chunk of text that is sent to the language model.
-    - `chunk_overlap` (Integer): The amount of overlap between each chunk of text.
-    - `length_function` (String): The function used to calculate the length of each chunk of text. In this case, it's set to 'len'. You can also use any function import directly as a string here.
-3. `llm` Section:
+        - `name` (String): The name of your full-stack application.
+        - `id` (String): The id of your full-stack application.
+        <Note>Only use this to reload already created apps. We recommend users to not create their own ids.</Note>
+        - `collect_metrics` (Boolean): Indicates whether metrics should be collected for the app, defaults to `True`
+        - `log_level` (String): The log level for the app, defaults to `WARNING`
+2. `llm` Section:
     - `provider` (String): The provider for the language model, which is set to 'openai'. You can find the full list of llm providers in [our docs](/components/llms).
-    - `model` (String): The specific model being used, 'gpt-3.5-turbo'.
     - `config`:
+        - `model` (String): The specific model being used, 'gpt-3.5-turbo'.
         - `temperature` (Float): Controls the randomness of the model's output. A higher value (closer to 1) makes the output more random.
         - `max_tokens` (Integer): Controls how many tokens are used in the response.
         - `top_p` (Float): Controls the diversity of word selection. A higher value (closer to 1) makes word selection more diverse.
         - `stream` (Boolean): Controls if the response is streamed back to the user (set to false).
         - `template` (String): A custom template for the prompt that the model uses to generate responses.
         - `system_prompt` (String): A system prompt for the model to follow when generating responses, in this case, it's set to the style of William Shakespeare.
-4. `vectordb` Section:
+        -  `stream` (Boolean): Controls if the response is streamed back to the user (set to false).
+        - `number_documents` (Integer): Number of documents to pull from the vectordb as context, defaults to 1
+3. `vectordb` Section:
     - `provider` (String): The provider for the vector database, set to 'chroma'. You can find the full list of vector database providers in [our docs](/components/vector-databases).
     - `config`:
-        - `collection_name` (String): The initial collection name for the database, set to 'full-stack-app'.
-        - `dir` (String): The directory for the database, set to 'db'.
-        - `allow_reset` (Boolean): Indicates whether resetting the database is allowed, set to true.
-5. `embedder` Section:
+        - `collection_name` (String): The initial collection name for the vectordb, set to 'full-stack-app'.
+        - `dir` (String): The directory for the local database, set to 'db'.
+        - `allow_reset` (Boolean): Indicates whether resetting the vectordb is allowed, set to true.
+    <Note>We recommend you to checkout vectordb specific config [here](https://docs.embedchain.ai/components/vector-databases)</Note>
+4. `embedder` Section:
     - `provider` (String): The provider for the embedder, set to 'openai'. You can find the full list of embedding model providers in [our docs](/components/embedding-models).
     - `config`:
         - `model` (String): The specific model used for text embedding, 'text-embedding-ada-002'.
-
+5. `chunker` Section:
+    - `chunk_size` (Integer): The size of each chunk of text that is sent to the language model.
+    - `chunk_overlap` (Integer): The amount of overlap between each chunk of text.
+    - `length_function` (String): The function used to calculate the length of each chunk of text. In this case, it's set to 'len'. You can also use any function import directly as a string here.
 If you have questions about the configuration above, please feel free to reach out to us using one of the following methods:
 
 <Snippet file="get-help.mdx" />

+ 5 - 5
docs/components/embedding-models.mdx

@@ -29,7 +29,7 @@ from embedchain import Pipeline as App
 os.environ['OPENAI_API_KEY'] = 'xxx'
 
 # load embedding model configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 
 app.add("https://en.wikipedia.org/wiki/OpenAI")
 app.query("What is OpenAI?")
@@ -59,7 +59,7 @@ os.environ["AZURE_OPENAI_ENDPOINT"] = "https://xxx.openai.azure.com/"
 os.environ["AZURE_OPENAI_API_KEY"] = "xxx"
 os.environ["OPENAI_API_VERSION"] = "xxx"
 
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -93,7 +93,7 @@ GPT4All supports generating high quality embeddings of arbitrary length document
 from embedchain import Pipeline as App
 
 # load embedding model configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -122,7 +122,7 @@ Hugging Face supports generating embeddings of arbitrary length documents of tex
 from embedchain import Pipeline as App
 
 # load embedding model configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -153,7 +153,7 @@ Embedchain supports Google's VertexAI embeddings model through a simple interfac
 from embedchain import Pipeline as App
 
 # load embedding model configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml

+ 9 - 9
docs/components/llms.mdx

@@ -46,7 +46,7 @@ from embedchain import Pipeline as App
 os.environ['OPENAI_API_KEY'] = 'xxx'
 
 # load llm configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -78,7 +78,7 @@ os.environ["OPENAI_API_BASE"] = "https://xxx.openai.azure.com/"
 os.environ["OPENAI_API_KEY"] = "xxx"
 os.environ["OPENAI_API_VERSION"] = "xxx"
 
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -115,7 +115,7 @@ from embedchain import Pipeline as App
 os.environ["ANTHROPIC_API_KEY"] = "xxx"
 
 # load llm configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -152,7 +152,7 @@ from embedchain import Pipeline as App
 os.environ["COHERE_API_KEY"] = "xxx"
 
 # load llm configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -183,7 +183,7 @@ GPT4all is a free-to-use, locally running, privacy-aware chatbot. No GPU or inte
 from embedchain import Pipeline as App
 
 # load llm configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -216,7 +216,7 @@ from embedchain import Pipeline as App
 
 os.environ["JINACHAT_API_KEY"] = "xxx"
 # load llm configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -253,7 +253,7 @@ from embedchain import Pipeline as App
 os.environ["HUGGINGFACE_ACCESS_TOKEN"] = "xxx"
 
 # load llm configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -283,7 +283,7 @@ from embedchain import Pipeline as App
 os.environ["REPLICATE_API_TOKEN"] = "xxx"
 
 # load llm configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -308,7 +308,7 @@ Setup Google Cloud Platform application credentials by following the instruction
 from embedchain import Pipeline as App
 
 # load llm configuration from config.yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml

+ 12 - 9
docs/components/vector-databases.mdx

@@ -25,7 +25,7 @@ Utilizing a vector database alongside Embedchain is a seamless process. All you
 from embedchain import Pipeline as App
 
 # load chroma configuration from yaml file
-app = App.from_config(yaml_path="config1.yaml")
+app = App.from_config(config_path="config1.yaml")
 ```
 
 ```yaml config1.yaml
@@ -64,7 +64,7 @@ pip install --upgrade 'embedchain[elasticsearch]'
 from embedchain import Pipeline as App
 
 # load elasticsearch configuration from yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -73,8 +73,11 @@ vectordb:
   config:
     collection_name: 'es-index'
     es_url: http://localhost:9200
-    allow_reset: true
+    http_auth:
+      - admin
+      - admin
     api_key: xxx
+    verify_certs: false
 ```
 </CodeGroup>
 
@@ -92,19 +95,19 @@ pip install --upgrade 'embedchain[opensearch]'
 from embedchain import Pipeline as App
 
 # load opensearch configuration from yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
 vectordb:
   provider: opensearch
   config:
+    collection_name: 'my-app'
     opensearch_url: 'https://localhost:9200'
     http_auth:
       - admin
       - admin
     vector_dimension: 1536
-    collection_name: 'my-app'
     use_ssl: false
     verify_certs: false
 ```
@@ -131,7 +134,7 @@ os.environ['ZILLIZ_CLOUD_URI'] = 'https://xxx.zillizcloud.com'
 os.environ['ZILLIZ_CLOUD_TOKEN'] = 'xxx'
 
 # load zilliz configuration from yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -167,7 +170,7 @@ In order to use Pinecone as vector database, set the environment variables `PINE
 from embedchain import Pipeline as App
 
 # load pinecone configuration from yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -190,7 +193,7 @@ In order to use Qdrant as a vector database, set the environment variables `QDRA
 from embedchain import Pipeline as App
 
 # load qdrant configuration from yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml
@@ -210,7 +213,7 @@ In order to use Weaviate as a vector database, set the environment variables `WE
 from embedchain import Pipeline as App
 
 # load weaviate configuration from yaml file
-app = App.from_config(yaml_path="config.yaml")
+app = App.from_config(config_path="config.yaml")
 ```
 
 ```yaml config.yaml

+ 12 - 0
docs/data-sources/data-type-handling.mdx

@@ -50,3 +50,15 @@ from embedchain import Pipeline as App
 naval_chat_bot = App()
 print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"))
 ```
+
+## Resetting an app and vector database
+
+You can reset the app by simply calling the `reset` method. This will delete the vector database and all other app related files.
+
+```python
+from embedchain import Pipeline as App
+
+app = App()
+app.add("https://www.youtube.com/watch?v=3qHkcs3kG44")
+app.reset()
+```

+ 1 - 1
docs/data-sources/docs-site.mdx

@@ -1,5 +1,5 @@
 ---
-title: '📚🌐 Code documentation'
+title: '📚 Code documentation'
 ---
 
 To add any code documentation website as a loader, use the data_type as `docs_site`. Eg:

+ 6 - 6
docs/data-sources/overview.mdx

@@ -5,20 +5,20 @@ title: Overview
 Embedchain comes with built-in support for various data sources. We handle the complexity of loading unstructured data from these data sources, allowing you to easily customize your app through a user-friendly interface.
 
 <CardGroup cols={4}>
-  <Card title="📊 csv" href="/data-sources/csv"></Card>
+  <Card title="📊 CSV" href="/data-sources/csv"></Card>
   <Card title="📃 JSON" href="/data-sources/json"></Card>
-  <Card title="📚🌐 docs site" href="/data-sources/docs-site"></Card>
+  <Card title="📚 docs site" href="/data-sources/docs-site"></Card>
   <Card title="📄 docx" href="/data-sources/docx"></Card>
   <Card title="📝 mdx" href="/data-sources/mdx"></Card>
-  <Card title="📓 notion" href="/data-sources/notion"></Card>
-  <Card title="📰 pdf" href="/data-sources/pdf-file"></Card>
+  <Card title="📓 Notion" href="/data-sources/notion"></Card>
+  <Card title="📰 PDF" href="/data-sources/pdf-file"></Card>
   <Card title="❓💬 q&a pair" href="/data-sources/qna"></Card>
   <Card title="🗺️ sitemap" href="/data-sources/sitemap"></Card>
   <Card title="📝 text" href="/data-sources/text"></Card>
-  <Card title="🌐📄 web page" href="/data-sources/web-page"></Card>
+  <Card title="🌐 web page" href="/data-sources/web-page"></Card>
   <Card title="🧾 xml" href="/data-sources/xml"></Card>
   <Card title="🙌 OpenAPI" href="/data-sources/openapi"></Card>
-  <Card title="📺 youtube video" href="/data-sources/youtube-video"></Card>
+  <Card title="📺 Youtube" href="/data-sources/youtube-video"></Card>
   <Card title="📬 Gmail" href="/data-sources/gmail"></Card>
   <Card title="🐘 Postgres" href="/data-sources/postgres"></Card>
   <Card title="🐬 MySQL" href="/data-sources/mysql"></Card>

+ 1 - 1
docs/data-sources/web-page.mdx

@@ -1,5 +1,5 @@
 ---
-title: '🌐📄 Web page'
+title: '🌐 Web page'
 ---
 
 To add any web page, use the data_type as `web_page`. Eg:

+ 1 - 1
docs/data-sources/youtube-video.mdx

@@ -1,5 +1,5 @@
 ---
-title: '📺 Youtube video'
+title: '📺 Youtube'
 ---
 
 

+ 8 - 1
docs/get-started/examples.mdx

@@ -1,8 +1,15 @@
 ---
 title: 🔎 Examples
-description: 'Collection of Google colab notebook and Replit links for users'
 ---
 
+# Explore awesome apps
+
+Check out the remarkable work accomplished using [Embedchain](https://app.embedchain.ai/custom-gpts/).
+
+## Collection of Google colab notebook and Replit links for users
+
+Get started with Embedchain by trying out the examples below. You can run the examples in your browser using Google Colab or Replit.
+
 <table>
   <thead>
     <tr>

+ 37 - 16
docs/get-started/faq.mdx

@@ -2,13 +2,36 @@
 title: ❓ FAQs
 description: 'Collections of all the frequently asked questions'
 ---
-
-#### Does Embedchain support OpenAI's Assistant APIs?
-
+<AccordionGroup>
+<Accordion title="Does Embedchain support OpenAI's Assistant APIs?">
 Yes, it does. Please refer to the [OpenAI Assistant docs page](/get-started/openai-assistant).
+</Accordion>
+<Accordion title="How to use MistralAI language model?">
+Use the model provided on huggingface: `mistralai/Mistral-7B-v0.1`
+<CodeGroup>
+```python main.py
+import os
+from embedchain import Pipeline as App
 
-#### How to use `gpt-4-turbo` model released on OpenAI DevDay?
+os.environ["OPENAI_API_KEY"] = "sk-xxx"
+os.environ["HUGGINGFACE_ACCESS_TOKEN"] = "hf_your_token"
 
+app = App.from_config("huggingface.yaml")
+```
+```yaml huggingface.yaml
+llm:
+  provider: huggingface
+  config:
+    model: 'mistralai/Mistral-7B-v0.1'
+    temperature: 0.5
+    max_tokens: 1000
+    top_p: 0.5
+    stream: false
+```
+</CodeGroup>
+</Accordion>
+<Accordion title="How to use ChatGPT 4 turbo model released on OpenAI DevDay?">
+Use the model `gpt-4-turbo` provided my openai.
 <CodeGroup>
 
 ```python main.py
@@ -18,7 +41,7 @@ from embedchain import Pipeline as App
 os.environ['OPENAI_API_KEY'] = 'xxx'
 
 # load llm configuration from gpt4_turbo.yaml file
-app = App.from_config(yaml_path="gpt4_turbo.yaml")
+app = App.from_config(config_path="gpt4_turbo.yaml")
 ```
 
 ```yaml gpt4_turbo.yaml
@@ -31,12 +54,9 @@ llm:
     top_p: 1
     stream: false
 ```
-
 </CodeGroup>
-
-
-#### How to use GPT-4 as the LLM model?
-
+</Accordion>
+<Accordion title="How to use GPT-4 as the LLM model?">
 <CodeGroup>
 
 ```python main.py
@@ -46,7 +66,7 @@ from embedchain import Pipeline as App
 os.environ['OPENAI_API_KEY'] = 'xxx'
 
 # load llm configuration from gpt4.yaml file
-app = App.from_config(yaml_path="gpt4.yaml")
+app = App.from_config(config_path="gpt4.yaml")
 ```
 
 ```yaml gpt4.yaml
@@ -61,9 +81,8 @@ llm:
 ```
 
 </CodeGroup>
-
-#### I don't have OpenAI credits. How can I use some open source model?
-
+</Accordion>
+<Accordion title="I don't have OpenAI credits. How can I use some open source model?">
 <CodeGroup>
 
 ```python main.py
@@ -73,7 +92,7 @@ from embedchain import Pipeline as App
 os.environ['OPENAI_API_KEY'] = 'xxx'
 
 # load llm configuration from opensource.yaml file
-app = App.from_config(yaml_path="opensource.yaml")
+app = App.from_config(config_path="opensource.yaml")
 ```
 
 ```yaml opensource.yaml
@@ -93,8 +112,10 @@ embedder:
 ```
 </CodeGroup>
 
-#### How to contact support?
+</Accordion>
+</AccordionGroup>
 
+#### Need more help?
 If docs aren't sufficient, please feel free to reach out to us using one of the following methods:
 
 <Snippet file="get-help.mdx" />

+ 3 - 1
docs/get-started/introduction.mdx

@@ -105,7 +105,7 @@ app.deploy()
 # ✅ Data of type: web_page, value: https://www.forbes.com/profile/elon-musk added successfully.
 ```
 
-## 🚀 How it works?
+## 🛠️ How it works?
 
 Embedchain abstracts out the following steps from you to easily create LLM powered apps:
 
@@ -129,3 +129,5 @@ The process of loading the dataset and querying involves multiple steps, each wi
 - How should I find similar documents for a query? Which ranking model should I use?
 
 Embedchain takes care of all these nuances and provides a simple interface to create apps on any data.
+
+## [🚀 Get started](https://docs.embedchain.ai/get-started/quickstart)

+ 54 - 60
docs/get-started/quickstart.mdx

@@ -12,79 +12,73 @@ pip install embedchain
 ```
 
 <Tip>
-Embedchain now supports OpenAI's latest `gpt-4-turbo` model. Checkout the [docs here](/get-started/faq#how-to-use-gpt-4-turbo-model-released-on-openai-devday) on how to use it.
+Embedchain now supports OpenAI's latest `gpt-4-turbo` model. Checkout the [FAQs](/get-started/faq#how-to-use-gpt-4-turbo-model-released-on-openai-devday).
 </Tip>
 
 Creating an app involves 3 steps:
 
 <Steps>
   <Step title="⚙️ Import app instance">
-```python
-from embedchain import Pipeline as App
-app = App()
-```
+    ```python
+    from embedchain import Pipeline as App
+    app = App()
+    ```
+    <Accordion title="Customize your app by a simple YAML config" icon="gear-complex">
+      Embedchain provides a wide range of options to customize your app. You can customize the model, data sources, and much more.
+      Explore the custom configurations [here](https://docs.embedchain.ai/advanced/configuration).
+      ```python
+      from embedchain import Pipeline as App
+      app = App(yaml_config="config.yaml")
+      ```
+    </Accordion>
   </Step>
   <Step title="🗃️ Add data sources">
-```python
-# Add different data sources
-app.add("https://en.wikipedia.org/wiki/Elon_Musk")
-app.add("https://www.forbes.com/profile/elon-musk")
-# You can also add local data sources such as pdf, csv files etc.
-# app.add("/path/to/file.pdf")
-```
+    ```python
+    app.add("https://en.wikipedia.org/wiki/Elon_Musk")
+    app.add("https://www.forbes.com/profile/elon-musk")
+    # app.add("path/to/file/elon_musk.pdf")
+    ```
+    <Accordion title="Embedchain supports adding data from many data sources." icon="files">
+      Embedchain supports adding data from many data sources including web pages, PDFs, databases, and more.
+      Explore the list of supported [data sources](https://docs.embedchain.ai/data-sources/overview).
+    </Accordion>
   </Step>
-  <Step title="💬 Query or chat or search context on your data">
-```python
-app.query("What is the net worth of Elon Musk today?")
-# Answer: The net worth of Elon Musk today is $258.7 billion.
-```
+  <Step title="💬 Ask questions, chat, or search through your data with ease">
+    ```python
+    app.query("What is the net worth of Elon Musk today?")
+    # Answer: The net worth of Elon Musk today is $258.7 billion.
+    ```
+    <Accordion title="Want to chat with your app?" icon="face-thinking">
+      Embedchain provides a wide range of features to interact with your app. You can chat with your app, ask questions, search through your data, and much more.
+      ```python
+      app.chat("How many companies does Elon Musk run? Name those")
+      # Answer: Elon Musk runs 3 companies: Tesla, SpaceX, and Neuralink.
+      app.chat("What is his net worth today?")
+      # Answer: The net worth of Elon Musk today is $258.7 billion.
+      ```
+      To learn about other features, click [here](https://docs.embedchain.ai/get-started/introduction)
+    </Accordion>
   </Step>
-  <Step title="🚀 (Optional) Deploy your pipeline to Embedchain Platform">
-```python
-app.deploy()
-# 🔑 Enter your Embedchain API key. You can find the API key at https://app.embedchain.ai/settings/keys/
-# ec-xxxxxx
-
-# 🛠️ Creating pipeline on the platform...
-# 🎉🎉🎉 Pipeline created successfully! View your pipeline: https://app.embedchain.ai/pipelines/xxxxx
-
-# 🛠️ Adding data to your pipeline...
-# ✅ Data of type: web_page, value: https://www.forbes.com/profile/elon-musk added successfully.
-```
+  <Step title="🚀 Seamlessly launch your App on the Embedchain Platform!">
+    ```python
+    app.deploy()
+    # 🔑 Enter your Embedchain API key. You can find the API key at https://app.embedchain.ai/settings/keys/
+    # ec-xxxxxx
+
+    # 🛠️ Creating pipeline on the platform...
+    # 🎉🎉🎉 Pipeline created successfully! View your pipeline: https://app.embedchain.ai/pipelines/xxxxx
+
+    # 🛠️ Adding data to your pipeline...
+    # ✅ Data of type: web_page, value: https://www.forbes.com/profile/elon-musk added successfully.
+    ```
+    <Accordion title="Share your app with others" icon="laptop-mobile">
+      You can now share your app with others from our platform.
+      Access your app on our [platform](https://app.embedchain.ai/).
+    </Accordion>
   </Step>
 </Steps>
 
-Putting it together, you can run your first app using the following code. Make sure to set the `OPENAI_API_KEY` 🔑 environment variable in the code.
-
-```python
-import os
-from embedchain import Pipeline as App
-
-os.environ["OPENAI_API_KEY"] = "xxx"
-app = App()
-
-# Add different data sources
-app.add("https://en.wikipedia.org/wiki/Elon_Musk")
-app.add("https://www.forbes.com/profile/elon-musk")
-# You can also add local data sources such as pdf, csv files etc.
-# app.add("/path/to/file.pdf")
-
-response = app.query("What is the net worth of Elon Musk today?")
-print(response)
-# Answer: The net worth of Elon Musk today is $258.7 billion.
-
-app.deploy()
-# 🔑 Enter your Embedchain API key. You can find the API key at https://app.embedchain.ai/settings/keys/
-# ec-xxxxxx
-
-# 🛠️ Creating pipeline on the platform...
-# 🎉🎉🎉 Pipeline created successfully! View your pipeline: https://app.embedchain.ai/pipelines/xxxxx
-
-# 🛠️ Adding data to your pipeline...
-# ✅ Data of type: web_page, value: https://www.forbes.com/profile/elon-musk added successfully.
-```
-
-You can try it out yourself using the following Google Colab notebook:
+Putting it together, you can run your first app using the following Google Colab. Make sure to set the `OPENAI_API_KEY` 🔑 environment variable in the code.
 
 <a href="https://colab.research.google.com/drive/17ON1LPonnXAtLaZEebnOktstB_1cJJmh?usp=sharing">
     <img src="https://camo.githubusercontent.com/84f0493939e0c4de4e6dbe113251b4bfb5353e57134ffd9fcab6b8714514d4d1/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667" alt="Open in Colab" />

+ 2 - 2
embedchain/apps/app.py

@@ -12,7 +12,7 @@ from embedchain.factory import EmbedderFactory, LlmFactory, VectorDBFactory
 from embedchain.helpers.json_serializable import register_deserializable
 from embedchain.llm.base import BaseLlm
 from embedchain.llm.openai import OpenAILlm
-from embedchain.utils import validate_yaml_config
+from embedchain.utils import validate_config
 from embedchain.vectordb.base import BaseVectorDB
 from embedchain.vectordb.chroma import ChromaDB
 
@@ -134,7 +134,7 @@ class App(EmbedChain):
             config_data = yaml.safe_load(file)
 
         try:
-            validate_yaml_config(config_data)
+            validate_config(config_data)
         except Exception as e:
             raise Exception(f"❌ Error occurred while validating the YAML config. Error: {str(e)}")
 

+ 4 - 1
embedchain/llm/huggingface.py

@@ -1,4 +1,5 @@
 import importlib
+import logging
 import os
 from typing import Optional
 
@@ -42,9 +43,11 @@ class HuggingFaceLlm(BaseLlm):
         else:
             raise ValueError("`top_p` must be > 0.0 and < 1.0")
 
+        model = config.model or "google/flan-t5-xxl"
+        logging.info(f"Using HuggingFaceHub with model {model}")
         llm = HuggingFaceHub(
             huggingfacehub_api_token=os.environ["HUGGINGFACE_ACCESS_TOKEN"],
-            repo_id=config.model or "google/flan-t5-xxl",
+            repo_id=model,
             model_kwargs=model_kwargs,
         )
 

+ 53 - 24
embedchain/pipeline.py

@@ -4,6 +4,7 @@ import logging
 import os
 import sqlite3
 import uuid
+from typing import Any, Dict, Optional
 
 import requests
 import yaml
@@ -19,7 +20,7 @@ from embedchain.helpers.json_serializable import register_deserializable
 from embedchain.llm.base import BaseLlm
 from embedchain.llm.openai import OpenAILlm
 from embedchain.telemetry.posthog import AnonymousTelemetry
-from embedchain.utils import validate_yaml_config
+from embedchain.utils import validate_config
 from embedchain.vectordb.base import BaseVectorDB
 from embedchain.vectordb.chroma import ChromaDB
 
@@ -43,7 +44,7 @@ class Pipeline(EmbedChain):
         db: BaseVectorDB = None,
         embedding_model: BaseEmbedder = None,
         llm: BaseLlm = None,
-        yaml_path: str = None,
+        config_data: dict = None,
         log_level=logging.WARN,
         auto_deploy: bool = False,
         chunker: ChunkerConfig = None,
@@ -59,15 +60,15 @@ class Pipeline(EmbedChain):
         :type embedding_model: BaseEmbedder, optional
         :param llm: The LLM model used to calculate embeddings, defaults to None
         :type llm: BaseLlm, optional
-        :param yaml_path: Path to the YAML configuration file, defaults to None
-        :type yaml_path: str, optional
+        :param config_data: Config dictionary, defaults to None
+        :type config_data: dict, optional
         :param log_level: Log level to use, defaults to logging.WARN
         :type log_level: int, optional
         :param auto_deploy: Whether to deploy the pipeline automatically, defaults to False
         :type auto_deploy: bool, optional
         :raises Exception: If an error occurs while creating the pipeline
         """
-        if id and yaml_path:
+        if id and config_data:
             raise Exception("Cannot provide both id and config. Please provide only one of them.")
 
         if id and name:
@@ -79,8 +80,8 @@ class Pipeline(EmbedChain):
         logging.basicConfig(level=log_level, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
         self.logger = logging.getLogger(__name__)
         self.auto_deploy = auto_deploy
-        # Store the yaml config as an attribute to be able to send it
-        self.yaml_config = None
+        # Store the dict config as an attribute to be able to send it
+        self.config_data = config_data if (config_data and validate_config(config_data)) else None
         self.client = None
         # pipeline_id from the backend
         self.id = None
@@ -92,11 +93,6 @@ class Pipeline(EmbedChain):
         self.name = self.config.name
         self.config.id = self.local_id = str(uuid.uuid4()) if self.config.id is None else self.config.id
 
-        if yaml_path:
-            with open(yaml_path, "r") as file:
-                config_data = yaml.safe_load(file)
-                self.yaml_config = config_data
-
         if id is not None:
             # Init client first since user is trying to fetch the pipeline
             # details from the platform
@@ -187,9 +183,9 @@ class Pipeline(EmbedChain):
         Create a pipeline on the platform.
         """
         print("🛠️ Creating pipeline on the platform...")
-        # self.yaml_config is a dict. Pass it inside the key 'yaml_config' to the backend
+        # self.config_data is a dict. Pass it inside the key 'yaml_config' to the backend
         payload = {
-            "yaml_config": json.dumps(self.yaml_config),
+            "yaml_config": json.dumps(self.config_data),
             "name": self.name,
             "local_id": self.local_id,
         }
@@ -346,24 +342,57 @@ class Pipeline(EmbedChain):
         self.telemetry.capture(event_name="deploy", properties=self._telemetry_props)
 
     @classmethod
-    def from_config(cls, yaml_path: str, auto_deploy: bool = False):
+    def from_config(
+        cls,
+        config_path: Optional[str] = None,
+        config: Optional[Dict[str, Any]] = None,
+        auto_deploy: bool = False,
+        yaml_path: Optional[str] = None,
+    ):
         """
-        Instantiate a Pipeline object from a YAML configuration file.
+        Instantiate a Pipeline object from a configuration.
 
-        :param yaml_path: Path to the YAML configuration file.
-        :type yaml_path: str
+        :param config_path: Path to the YAML or JSON configuration file.
+        :type config_path: Optional[str]
+        :param config: A dictionary containing the configuration.
+        :type config: Optional[Dict[str, Any]]
         :param auto_deploy: Whether to deploy the pipeline automatically, defaults to False
         :type auto_deploy: bool, optional
+        :param yaml_path: (Deprecated) Path to the YAML configuration file. Use config_path instead.
+        :type yaml_path: Optional[str]
         :return: An instance of the Pipeline class.
         :rtype: Pipeline
         """
-        with open(yaml_path, "r") as file:
-            config_data = yaml.safe_load(file)
+        # Backward compatibility for yaml_path
+        if yaml_path and not config_path:
+            config_path = yaml_path
+
+        if config_path and config:
+            raise ValueError("Please provide only one of config_path or config.")
+
+        config_data = None
+
+        if config_path:
+            file_extension = os.path.splitext(config_path)[1]
+            with open(config_path, "r") as file:
+                if file_extension in [".yaml", ".yml"]:
+                    config_data = yaml.safe_load(file)
+                elif file_extension == ".json":
+                    config_data = json.load(file)
+                else:
+                    raise ValueError("config_path must be a path to a YAML or JSON file.")
+        elif config and isinstance(config, dict):
+            config_data = config
+        else:
+            logging.error(
+                "Please provide either a config file path (YAML or JSON) or a config dictionary. Falling back to defaults because no config is provided.",  # noqa: E501
+            )
+            config_data = {}
 
         try:
-            validate_yaml_config(config_data)
+            validate_config(config_data)
         except Exception as e:
-            raise Exception(f"❌ Error occurred while validating the YAML config. Error: {str(e)}")
+            raise Exception(f"Error occurred while validating the config. Error: {str(e)}")
 
         pipeline_config_data = config_data.get("app", {}).get("config", {})
         db_config_data = config_data.get("vectordb", {})
@@ -388,7 +417,7 @@ class Pipeline(EmbedChain):
         )
 
         # Send anonymous telemetry
-        event_properties = {"init_type": "yaml_config"}
+        event_properties = {"init_type": "config_data"}
         AnonymousTelemetry().capture(event_name="init", properties=event_properties)
 
         return cls(
@@ -396,7 +425,7 @@ class Pipeline(EmbedChain):
             llm=llm,
             db=db,
             embedding_model=embedding_model,
-            yaml_path=yaml_path,
+            config_data=config_data,
             auto_deploy=auto_deploy,
             chunker=chunker_config_data,
         )

+ 1 - 1
embedchain/store/assistants.py

@@ -165,7 +165,7 @@ class AIAssistant:
         self.instructions = instructions
         self.assistant_id = assistant_id or str(uuid.uuid4())
         self.thread_id = thread_id or str(uuid.uuid4())
-        self.pipeline = Pipeline.from_config(yaml_path=yaml_path) if yaml_path else Pipeline()
+        self.pipeline = Pipeline.from_config(config_path=yaml_path) if yaml_path else Pipeline()
         self.pipeline.local_id = self.pipeline.config.id = self.thread_id
 
         if self.instructions:

+ 1 - 1
embedchain/utils.py

@@ -355,7 +355,7 @@ def is_valid_json_string(source: str):
         return False
 
 
-def validate_yaml_config(config_data):
+def validate_config(config_data):
     schema = Schema(
         {
             Optional("app"): {

+ 6 - 6
examples/rest-api/main.py

@@ -108,7 +108,7 @@ async def get_datasources_associated_with_app_id(app_id: str, db: Session = Depe
         if db_app is None:
             raise HTTPException(detail=f"App with id {app_id} does not exist, please create it first.", status_code=400)
 
-        app = App.from_config(yaml_path=db_app.config)
+        app = App.from_config(config_path=db_app.config)
 
         response = app.get_data_sources()
         return {"results": response}
@@ -147,7 +147,7 @@ async def add_datasource_to_an_app(body: SourceApp, app_id: str, db: Session = D
         if db_app is None:
             raise HTTPException(detail=f"App with id {app_id} does not exist, please create it first.", status_code=400)
 
-        app = App.from_config(yaml_path=db_app.config)
+        app = App.from_config(config_path=db_app.config)
 
         response = app.add(source=body.source, data_type=body.data_type)
         return DefaultResponse(response=response)
@@ -185,7 +185,7 @@ async def query_an_app(body: QueryApp, app_id: str, db: Session = Depends(get_db
         if db_app is None:
             raise HTTPException(detail=f"App with id {app_id} does not exist, please create it first.", status_code=400)
 
-        app = App.from_config(yaml_path=db_app.config)
+        app = App.from_config(config_path=db_app.config)
 
         response = app.query(body.query)
         return DefaultResponse(response=response)
@@ -227,7 +227,7 @@ async def query_an_app(body: QueryApp, app_id: str, db: Session = Depends(get_db
 #               status_code=400
 #             )
 
-#         app = App.from_config(yaml_path=db_app.config)
+#         app = App.from_config(config_path=db_app.config)
 
 #         response = app.chat(body.message)
 #         return DefaultResponse(response=response)
@@ -264,7 +264,7 @@ async def deploy_app(body: DeployAppRequest, app_id: str, db: Session = Depends(
         if db_app is None:
             raise HTTPException(detail=f"App with id {app_id} does not exist, please create it first.", status_code=400)
 
-        app = App.from_config(yaml_path=db_app.config)
+        app = App.from_config(config_path=db_app.config)
 
         api_key = body.api_key
         # this will save the api key in the embedchain.db
@@ -305,7 +305,7 @@ async def delete_app(app_id: str, db: Session = Depends(get_db)):
         if db_app is None:
             raise HTTPException(detail=f"App with id {app_id} does not exist, please create it first.", status_code=400)
 
-        app = App.from_config(yaml_path=db_app.config)
+        app = App.from_config(config_path=db_app.config)
 
         # reset app.db
         app.db.reset()

+ 1 - 1
notebooks/anthropic.ipynb

@@ -109,7 +109,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"anthropic.yaml\")"
+        "app = App.from_config(config_path=\"anthropic.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/azure-openai.ipynb

@@ -105,7 +105,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "app = App.from_config(yaml_path=\"azure_openai.yaml\")"
+    "app = App.from_config(config_path=\"azure_openai.yaml\")"
    ]
   },
   {

+ 1 - 1
notebooks/chromadb.ipynb

@@ -105,7 +105,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"chromadb.yaml\")"
+        "app = App.from_config(config_path=\"chromadb.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/cohere.ipynb

@@ -114,7 +114,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"cohere.yaml\")"
+        "app = App.from_config(config_path=\"cohere.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/elasticsearch.ipynb

@@ -103,7 +103,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"elasticsearch.yaml\")"
+        "app = App.from_config(config_path=\"elasticsearch.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/gpt4all.ipynb

@@ -114,7 +114,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"gpt4all.yaml\")"
+        "app = App.from_config(config_path=\"gpt4all.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/hugging_face_hub.ipynb

@@ -114,7 +114,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"huggingface.yaml\")"
+        "app = App.from_config(config_path=\"huggingface.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/jina.ipynb

@@ -114,7 +114,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"jina.yaml\")"
+        "app = App.from_config(config_path=\"jina.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/llama2.ipynb

@@ -109,7 +109,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"llama2.yaml\")"
+        "app = App.from_config(config_path=\"llama2.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/openai.ipynb

@@ -115,7 +115,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"openai.yaml\")"
+        "app = App.from_config(config_path=\"openai.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/opensearch.ipynb

@@ -107,7 +107,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"opensearch.yaml\")"
+        "app = App.from_config(config_path=\"opensearch.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/pinecone.ipynb

@@ -104,7 +104,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"pinecone.yaml\")"
+        "app = App.from_config(config_path=\"pinecone.yaml\")"
       ]
     },
     {

+ 1 - 1
notebooks/vertex_ai.ipynb

@@ -117,7 +117,7 @@
       },
       "outputs": [],
       "source": [
-        "app = App.from_config(yaml_path=\"vertexai.yaml\")"
+        "app = App.from_config(config_path=\"vertexai.yaml\")"
       ]
     },
     {

+ 1 - 1
pyproject.toml

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "embedchain"
-version = "0.1.22"
+version = "0.1.23"
 description = "Data platform for LLMs - Load, index, retrieve and sync any unstructured data"
 authors = [
     "Taranjeet Singh <taranjeet@embedchain.ai>",

+ 2 - 2
tests/test_utils.py

@@ -1,6 +1,6 @@
 import yaml
 
-from embedchain.utils import validate_yaml_config
+from embedchain.utils import validate_config
 
 CONFIG_YAMLS = [
     "configs/anthropic.yaml",
@@ -30,7 +30,7 @@ def test_all_config_yamls():
         assert config is not None
 
         try:
-            validate_yaml_config(config)
+            validate_config(config)
         except Exception as e:
             print(f"Error in {config_yaml}: {e}")
             raise e