data_types.mdx 3.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
  1. ---
  2. title: '📋 Supported data formats'
  3. ---
  4. Embedchain supports following data formats:
  5. ### Youtube video
  6. To add any youtube video to your app, use the data_type (first argument to `.add()` method) as `youtube_video`. Eg:
  7. ```python
  8. app.add('youtube_video', 'a_valid_youtube_url_here')
  9. ```
  10. ### PDF file
  11. To add any pdf file, use the data_type as `pdf_file`. Eg:
  12. ```python
  13. app.add('pdf_file', 'a_valid_url_where_pdf_file_can_be_accessed')
  14. ```
  15. Note that we do not support password protected pdfs.
  16. ### Web page
  17. To add any web page, use the data_type as `web_page`. Eg:
  18. ```python
  19. app.add('web_page', 'a_valid_web_page_url')
  20. ```
  21. ### Sitemap
  22. Add all web pages from an xml-sitemap. Filters non-text files. Use the data_type as `sitemap`. Eg:
  23. ```python
  24. app.add('sitemap', 'https://example.com/sitemap.xml')
  25. ```
  26. ### Doc file
  27. To add any doc/docx file, use the data_type as `docx`. Eg:
  28. ```python
  29. app.add('docx', 'a_local_docx_file_path')
  30. ```
  31. ### Code documentation website loader
  32. To add any code documentation website as a loader, use the data_type as `docs_site`. Eg:
  33. ```python
  34. app.add("docs_site", "https://docs.embedchain.ai/")
  35. ```
  36. ### Notion
  37. To use notion you must install the extra dependencies with `pip install embedchain[notion]`.
  38. To load a notion page, use the data_type as `notion`.
  39. The next argument must **end** with the `notion page id`. The id is a 32-character string. Eg:
  40. ```python
  41. app.add("notion", "cfbc134ca6464fc980d0391613959196")
  42. app.add("notion", "my-page-cfbc134ca6464fc980d0391613959196")
  43. app.add("notion", "https://www.notion.so/my-page-cfbc134ca6464fc980d0391613959196")
  44. ```
  45. ### Text
  46. To supply your own text, use the data_type as `text` and enter a string. The text is not processed, this can be very versatile. Eg:
  47. ```python
  48. app.add_local('text', 'Seek wealth, not money or status. Wealth is having assets that earn while you sleep. Money is how we transfer time and wealth. Status is your place in the social hierarchy.')
  49. ```
  50. Note: This is not used in the examples because in most cases you will supply a whole paragraph or file, which did not fit.
  51. ### QnA pair
  52. To supply your own QnA pair, use the data_type as `qna_pair` and enter a tuple. Eg:
  53. ```python
  54. app.add_local('qna_pair', ("Question", "Answer"))
  55. ```
  56. ## Reusing a vector database
  57. Default behavior is to create a persistent vector DB in the directory **./db**. You can split your application into two Python scripts: one to create a local vector DB and the other to reuse this local persistent vector DB. This is useful when you want to index hundreds of documents and separately implement a chat interface.
  58. Create a local index:
  59. ```python
  60. from embedchain import App
  61. naval_chat_bot = App()
  62. naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44")
  63. naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")
  64. ```
  65. You can reuse the local index with the same code, but without adding new documents:
  66. ```python
  67. from embedchain import App
  68. naval_chat_bot = App()
  69. print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"))
  70. ```
  71. ### More formats (coming soon!)
  72. - If you want to add any other format, please create an [issue](https://github.com/embedchain/embedchain/issues) and we will add it to the list of supported formats.