evaluation.mdx 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275
  1. ---
  2. title: 🔬 Evaluation
  3. ---
  4. ## Overview
  5. We provide out-of-the-box evaluation metrics for your RAG application. You can use them to evaluate your RAG applications and compare against different settings of your production RAG application.
  6. Currently, we provide support for following evaluation metrics:
  7. <CardGroup cols={3}>
  8. <Card title="Context Relevancy" href="#context_relevancy"></Card>
  9. <Card title="Answer Relevancy" href="#answer_relevancy"></Card>
  10. <Card title="Groundedness" href="#groundedness"></Card>
  11. <Card title="Custom Metric" href="#custom_metric"></Card>
  12. </CardGroup>
  13. ## Quickstart
  14. Here is a basic example of running evaluation:
  15. ```python example.py
  16. from embedchain import App
  17. app = App()
  18. # Add data sources
  19. app.add("https://www.forbes.com/profile/elon-musk")
  20. # Run evaluation
  21. app.evaluate(["What is the net worth of Elon Musk?", "How many companies Elon Musk owns?"])
  22. # {'answer_relevancy': 0.9987286412340826, 'groundedness': 1.0, 'context_relevancy': 0.3571428571428571}
  23. ```
  24. Under the hood, Embedchain does the following:
  25. 1. Runs semantic search in the vector database and fetches context
  26. 2. LLM call with question, context to fetch the answer
  27. 3. Run evaluation on following metrics: `context relevancy`, `groundedness`, and `answer relevancy` and return result
  28. ## Advanced Usage
  29. We use OpenAI's `gpt-4` model as default LLM model for automatic evaluation. Hence, we require you to set `OPENAI_API_KEY` as an environment variable.
  30. ### Step-1: Create dataset
  31. In order to evaluate your RAG application, you have to setup a dataset. A data point in the dataset consists of `questions`, `contexts`, `answer`. Here is an example of how to create a dataset for evaluation:
  32. ```python
  33. from embedchain.utils.eval import EvalData
  34. data = [
  35. {
  36. "question": "What is the net worth of Elon Musk?",
  37. "contexts": [
  38. "Elon Musk PROFILEElon MuskCEO, ...",
  39. "a Twitter poll on whether the journalists' ...",
  40. "2016 and run by Jared Birchall.[335]...",
  41. ],
  42. "answer": "As of the information provided, Elon Musk's net worth is $241.6 billion.",
  43. },
  44. {
  45. "question": "which companies does Elon Musk own?",
  46. "contexts": [
  47. "of December 2023[update], ...",
  48. "ThielCofounderView ProfileTeslaHolds ...",
  49. "Elon Musk PROFILEElon MuskCEO, ...",
  50. ],
  51. "answer": "Elon Musk owns several companies, including Tesla, SpaceX, Neuralink, and The Boring Company.",
  52. },
  53. ]
  54. dataset = []
  55. for d in data:
  56. eval_data = EvalData(question=d["question"], contexts=d["contexts"], answer=d["answer"])
  57. dataset.append(eval_data)
  58. ```
  59. ### Step-2: Run evaluation
  60. Once you have created your dataset, you can run evaluation on the dataset by picking the metric you want to run evaluation on.
  61. For example, you can run evaluation on context relevancy metric using the following code:
  62. ```python
  63. from embedchain.evaluation.metrics import ContextRelevance
  64. metric = ContextRelevance()
  65. score = metric.evaluate(dataset)
  66. print(score)
  67. ```
  68. You can choose a different metric or write your own to run evaluation on. You can check the following links:
  69. - [Context Relevancy](#context_relevancy)
  70. - [Answer relenvancy](#answer_relevancy)
  71. - [Groundedness](#groundedness)
  72. - [Build your own metric](#custom_metric)
  73. ## Metrics
  74. ### Context Relevancy <a id="context_relevancy"></a>
  75. Context relevancy is a metric to determine "how relevant the context is to the question". We use OpenAI's `gpt-4` model to determine the relevancy of the context. We achieve this by prompting the model with the question and the context and asking it to return relevant sentences from the context. We then use the following formula to determine the score:
  76. ```
  77. context_relevance_score = num_relevant_sentences_in_context / num_of_sentences_in_context
  78. ```
  79. #### Examples
  80. You can run the context relevancy evaluation with the following simple code:
  81. ```python
  82. from embedchain.evaluation.metrics import ContextRelevance
  83. metric = ContextRelevance()
  84. score = metric.evaluate(dataset) # 'dataset' is definted in the create dataset section
  85. print(score)
  86. # 0.27975528364849833
  87. ```
  88. In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `ContextRelevanceConfig` class.
  89. Here is a more advanced example of how to pass a custom evaluation config for evaluating on context relevance metric:
  90. ```python
  91. from embedchain.config.evaluation.base import ContextRelevanceConfig
  92. from embedchain.evaluation.metrics import ContextRelevance
  93. eval_config = ContextRelevanceConfig(model="gpt-4", api_key="sk-xxx", language="en")
  94. metric = ContextRelevance(config=eval_config)
  95. metric.evaluate(dataset)
  96. ```
  97. #### `ContextRelevanceConfig`
  98. <ParamField path="model" type="str" optional>
  99. The model to use for the evaluation. Defaults to `gpt-4`. We only support openai's models for now.
  100. </ParamField>
  101. <ParamField path="api_key" type="str" optional>
  102. The openai api key to use for the evaluation. Defaults to `None`. If not provided, we will use the `OPENAI_API_KEY` environment variable.
  103. </ParamField>
  104. <ParamField path="language" type="str" optional>
  105. The language of the dataset being evaluated. We need this to determine the understand the context provided in the dataset. Defaults to `en`.
  106. </ParamField>
  107. <ParamField path="prompt" type="str" optional>
  108. The prompt to extract the relevant sentences from the context. Defaults to `CONTEXT_RELEVANCY_PROMPT`, which can be found at `embedchain.config.evaluation.base` path.
  109. </ParamField>
  110. ### Answer Relevancy <a id="answer_relevancy"></a>
  111. Answer relevancy is a metric to determine how relevant the answer is to the question. We prompt the model with the answer and asking it to generate questions from the answer. We then use the cosine similarity between the generated questions and the original question to determine the score.
  112. ```
  113. answer_relevancy_score = mean(cosine_similarity(generated_questions, original_question))
  114. ```
  115. #### Examples
  116. You can run the answer relevancy evaluation with the following simple code:
  117. ```python
  118. from embedchain.evaluation.metrics import AnswerRelevance
  119. metric = AnswerRelevance()
  120. score = metric.evaluate(dataset)
  121. print(score)
  122. # 0.9505334177461916
  123. ```
  124. In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `AnswerRelevanceConfig` class. Here is a more advanced example where you can provide your own evaluation config:
  125. ```python
  126. from embedchain.config.evaluation.base import AnswerRelevanceConfig
  127. from embedchain.evaluation.metrics import AnswerRelevance
  128. eval_config = AnswerRelevanceConfig(
  129. model='gpt-4',
  130. embedder="text-embedding-ada-002",
  131. api_key="sk-xxx",
  132. num_gen_questions=2
  133. )
  134. metric = AnswerRelevance(config=eval_config)
  135. score = metric.evaluate(dataset)
  136. ```
  137. #### `AnswerRelevanceConfig`
  138. <ParamField path="model" type="str" optional>
  139. The model to use for the evaluation. Defaults to `gpt-4`. We only support openai's models for now.
  140. </ParamField>
  141. <ParamField path="embedder" type="str" optional>
  142. The embedder to use for embedding the text. Defaults to `text-embedding-ada-002`. We only support openai's embedders for now.
  143. </ParamField>
  144. <ParamField path="api_key" type="str" optional>
  145. The openai api key to use for the evaluation. Defaults to `None`. If not provided, we will use the `OPENAI_API_KEY` environment variable.
  146. </ParamField>
  147. <ParamField path="num_gen_questions" type="int" optional>
  148. The number of questions to generate for each answer. We use the generated questions to compare the similarity with the original question to determine the score. Defaults to `1`.
  149. </ParamField>
  150. <ParamField path="prompt" type="str" optional>
  151. The prompt to extract the `num_gen_questions` number of questions from the provided answer. Defaults to `ANSWER_RELEVANCY_PROMPT`, which can be found at `embedchain.config.evaluation.base` path.
  152. </ParamField>
  153. ## Groundedness <a id="groundedness"></a>
  154. Groundedness is a metric to determine how grounded the answer is to the context. We use OpenAI's `gpt-4` model to determine the groundedness of the answer. We achieve this by prompting the model with the answer and asking it to generate claims from the answer. We then again prompt the model with the context and the generated claims to determine the verdict on the claims. We then use the following formula to determine the score:
  155. ```
  156. groundedness_score = (sum of all verdicts) / (total # of claims)
  157. ```
  158. You can run the groundedness evaluation with the following simple code:
  159. ```python
  160. from embedchain.evaluation.metrics import Groundedness
  161. metric = Groundedness()
  162. score = metric.evaluate(dataset) # dataset from above
  163. print(score)
  164. # 1.0
  165. ```
  166. In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `GroundednessConfig` class. Here is a more advanced example where you can configure the evaluation config:
  167. ```python
  168. from embedchain.config.evaluation.base import GroundednessConfig
  169. from embedchain.evaluation.metrics import Groundedness
  170. eval_config = GroundednessConfig(model='gpt-4', api_key="sk-xxx")
  171. metric = Groundedness(config=eval_config)
  172. score = metric.evaluate(dataset)
  173. ```
  174. #### `GroundednessConfig`
  175. <ParamField path="model" type="str" optional>
  176. The model to use for the evaluation. Defaults to `gpt-4`. We only support openai's models for now.
  177. </ParamField>
  178. <ParamField path="api_key" type="str" optional>
  179. The openai api key to use for the evaluation. Defaults to `None`. If not provided, we will use the `OPENAI_API_KEY` environment variable.
  180. </ParamField>
  181. <ParamField path="answer_claims_prompt" type="str" optional>
  182. The prompt to extract the claims from the provided answer. Defaults to `GROUNDEDNESS_ANSWER_CLAIMS_PROMPT`, which can be found at `embedchain.config.evaluation.base` path.
  183. </ParamField>
  184. <ParamField path="claims_inference_prompt" type="str" optional>
  185. The prompt to get verdicts on the claims from the answer from the given context. Defaults to `GROUNDEDNESS_CLAIMS_INFERENCE_PROMPT`, which can be found at `embedchain.config.evaluation.base` path.
  186. </ParamField>
  187. ## Custom <a id="custom_metric"></a>
  188. You can also create your own evaluation metric by extending the `BaseMetric` class. You can find the source code for the existing metrics at `embedchain.evaluation.metrics` path.
  189. <Note>
  190. You must provide the `name` of your custom metric in the `__init__` method of your class. This name will be used to identify your metric in the evaluation report.
  191. </Note>
  192. ```python
  193. from typing import Optional
  194. from embedchain.config.base_config import BaseConfig
  195. from embedchain.evaluation.metrics import BaseMetric
  196. from embedchain.utils.eval import EvalData
  197. class MyCustomMetric(BaseMetric):
  198. def __init__(self, config: Optional[BaseConfig] = None):
  199. super().__init__(name="my_custom_metric")
  200. def evaluate(self, dataset: list[EvalData]):
  201. score = 0.0
  202. # write your evaluation logic here
  203. return score
  204. ```