data_types.mdx 4.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
  1. ---
  2. title: '📋 Supported data formats'
  3. ---
  4. ## Automatic data type detection
  5. The add method automatically tries to detect the data_type, based on your input for the source argument. So `app.add('https://www.youtube.com/watch?v=dQw4w9WgXcQ')` is enough to embed a YouTube video.
  6. This detection is implemented for all formats. It is based on factors such as whether it's a URL, a local file, the source data type, etc.
  7. ### Debugging automatic detection
  8. Set `log_level=DEBUG` (in [AppConfig](http://localhost:3000/advanced/query_configuration#appconfig)) and make sure it's working as intended.
  9. Otherwise, you will not know when, for instance, an invalid filepath is interpreted as raw text instead.
  10. ### Forcing a data type
  11. To omit any issues with the data type detection, you can **force** a data_type by adding it as a `add` method argument.
  12. The examples below show you the keyword to force the respective `data_type`.
  13. Forcing can also be used for edge cases, such as interpreting a sitemap as a web_page, for reading it's raw text instead of following links.
  14. ## Remote Data Types
  15. <Tip>
  16. **Use local files in remote data types**
  17. Some data_types are meant for remote content and only work with URLs.
  18. You can pass local files by formatting the path using the `file:` [URI scheme](https://en.wikipedia.org/wiki/File_URI_scheme), e.g. `file:///info.pdf`.
  19. </Tip>
  20. ### Youtube video
  21. To add any youtube video to your app, use the data_type (first argument to `.add()` method) as `youtube_video`. Eg:
  22. ```python
  23. app.add('a_valid_youtube_url_here', data_type='youtube_video')
  24. ```
  25. ### PDF file
  26. To add any pdf file, use the data_type as `pdf_file`. Eg:
  27. ```python
  28. app.add('a_valid_url_where_pdf_file_can_be_accessed', data_type='pdf_file')
  29. ```
  30. Note that we do not support password protected pdfs.
  31. ### Web page
  32. To add any web page, use the data_type as `web_page`. Eg:
  33. ```python
  34. app.add('a_valid_web_page_url', data_type='web_page')
  35. ```
  36. ### Sitemap
  37. Add all web pages from an xml-sitemap. Filters non-text files. Use the data_type as `sitemap`. Eg:
  38. ```python
  39. app.add('https://example.com/sitemap.xml', data_type='sitemap')
  40. ```
  41. ### Doc file
  42. To add any doc/docx file, use the data_type as `docx`. `docx` allows remote urls and conventional file paths. Eg:
  43. ```python
  44. app.add('https://example.com/content/intro.docx', data_type="docx")
  45. app.add('content/intro.docx', data_type="docx")
  46. ```
  47. ### Code documentation website loader
  48. To add any code documentation website as a loader, use the data_type as `docs_site`. Eg:
  49. ```python
  50. app.add("https://docs.embedchain.ai/", data_type="docs_site")
  51. ```
  52. ### Notion
  53. To use notion you must install the extra dependencies with `pip install embedchain[notion]`.
  54. To load a notion page, use the data_type as `notion`. Since it is hard to automatically detect, forcing this is advised.
  55. The next argument must **end** with the `notion page id`. The id is a 32-character string. Eg:
  56. ```python
  57. app.add("cfbc134ca6464fc980d0391613959196", "notion")
  58. app.add("my-page-cfbc134ca6464fc980d0391613959196", "notion")
  59. app.add("https://www.notion.so/my-page-cfbc134ca6464fc980d0391613959196", "notion")
  60. ```
  61. ## Local Data Types
  62. ### Text
  63. To supply your own text, use the data_type as `text` and enter a string. The text is not processed, this can be very versatile. Eg:
  64. ```python
  65. app.add('Seek wealth, not money or status. Wealth is having assets that earn while you sleep. Money is how we transfer time and wealth. Status is your place in the social hierarchy.', data_type='text')
  66. ```
  67. Note: This is not used in the examples because in most cases you will supply a whole paragraph or file, which did not fit.
  68. ### QnA pair
  69. To supply your own QnA pair, use the data_type as `qna_pair` and enter a tuple. Eg:
  70. ```python
  71. app.add(("Question", "Answer"), data_type="qna_pair")
  72. ```
  73. ## Reusing a vector database
  74. Default behavior is to create a persistent vector DB in the directory **./db**. You can split your application into two Python scripts: one to create a local vector DB and the other to reuse this local persistent vector DB. This is useful when you want to index hundreds of documents and separately implement a chat interface.
  75. Create a local index:
  76. ```python
  77. from embedchain import App
  78. naval_chat_bot = App()
  79. naval_chat_bot.add("https://www.youtube.com/watch?v=3qHkcs3kG44")
  80. naval_chat_bot.add("https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")
  81. ```
  82. You can reuse the local index with the same code, but without adding new documents:
  83. ```python
  84. from embedchain import App
  85. naval_chat_bot = App()
  86. print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"))
  87. ```
  88. ## More formats (coming soon!)
  89. - If you want to add any other format, please create an [issue](https://github.com/embedchain/embedchain/issues) and we will add it to the list of supported formats.