evaluation.mdx 9.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208
  1. ---
  2. title: 🔬 Evaluation
  3. ---
  4. ## Overview
  5. We provide out-of-the-box evaluation methods for your datasets. You can use them to evaluate your models and compare them with other models.
  6. Currently, we provide the following evaluation methods:
  7. <CardGroup cols={3}>
  8. <Card title="Context Relevancy" href="#context_relevancy"></Card>
  9. <Card title="Answer Relevancy" href="#answer_relevancy"></Card>
  10. <Card title="Groundedness" href="#groundedness"></Card>
  11. <Card title="Custom" href="#custom"></Card>
  12. </CardGroup>
  13. More evaluation metrics are coming soon! 🏗️
  14. ## Usage
  15. We have found that the best way to evaluate datasets is with the help of OpenAI's `gpt-4` model. Hence, we require you to set `OPENAI_API_KEY` as an environment variable. If you don't want to set it, you can pass it in the config argument of the respective evaluation class, as shown in the examples later below.
  16. <Accordion title="We will assume the following dataset for the examples below">
  17. <CodeGroup>
  18. ```python main.py
  19. from embedchain.utils.eval import EvalData
  20. data = [
  21. {
  22. "question": "What is the net worth of Elon Musk?",
  23. "contexts": [
  24. """Elon Musk PROFILEElon MuskCEO, ...""",
  25. """a Twitter poll on whether the journalists' ...""",
  26. """2016 and run by Jared Birchall.[335]...""",
  27. ],
  28. "answer": "As of the information provided, Elon Musk's net worth is $241.6 billion.",
  29. },
  30. {
  31. "question": "which companies does Elon Musk own?",
  32. "contexts": [
  33. """of December 2023[update], ...""",
  34. """ThielCofounderView ProfileTeslaHolds ...""",
  35. """Elon Musk PROFILEElon MuskCEO, ...""",
  36. ],
  37. "answer": "Elon Musk owns several companies, including Tesla, SpaceX, Neuralink, and The Boring Company.",
  38. },
  39. ]
  40. dataset = []
  41. for d in data:
  42. dataset.append(EvalData(question=d["question"], contexts=d["contexts"], answer=d["answer"]))
  43. ```
  44. </CodeGroup>
  45. </Accordion>
  46. ## Context Relevancy <a id="context_relevancy"></a>
  47. Context relevancy is a metric to determine how relevant the context is to the question. We use OpenAI's `gpt-4` model to determine the relevancy of the context.
  48. We achieve this by prompting the model with the question and the context and asking it to return relevant sentences from the context. We then use the following formula to determine the score:
  49. context_relevance_score = (# of relevant sentences in context) $$\div$$ (total # of sentences in context)
  50. You can run the context relevancy evaluation with the following simple code:
  51. ```python
  52. from embedchain.eval.metrics import ContextRelevance
  53. metric = ContextRelevance()
  54. score = metric.evaluate(dataset) # dataset from above
  55. print(score)
  56. # 0.27975528364849833
  57. ```
  58. In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `ContextRelevanceConfig` class.
  59. ### ContextRelevanceConfig
  60. <ParamField path="model" type="str" optional>
  61. The model to use for the evaluation. Defaults to `gpt-4`. We only support openai's models for now.
  62. </ParamField>
  63. <ParamField path="api_key" type="str" optional>
  64. The openai api key to use for the evaluation. Defaults to `None`. If not provided, we will use the `OPENAI_API_KEY` environment variable.
  65. </ParamField>
  66. <ParamField path="language" type="str" optional>
  67. The language of the dataset being evaluated. We need this to determine the understand the context provided in the dataset. Defaults to `en`.
  68. </ParamField>
  69. <ParamField path="prompt" type="str" optional>
  70. The prompt to extract the relevant sentences from the context. Defaults to `CONTEXT_RELEVANCY_PROMPT`, which can be found at `embedchain.config.eval.base` path.
  71. </ParamField>
  72. ```python
  73. openai_api_key = "sk-xxx"
  74. metric = ContextRelevance(config=ContextRelevanceConfig(model='gpt-4', api_key=openai_api_key, language="en"))
  75. print(metric.evaluate(dataset))
  76. ```
  77. ## Answer Relevancy <a id="answer_relevancy"></a>
  78. Answer relevancy is a metric to determine how relevant the answer is to the question. We use OpenAI's `gpt-4` model to determine the relevancy of the answer.
  79. We achieve this by prompting the model with the answer and asking it to generate questions from the answer. We then use the cosine similarity between the generated questions and the original question to determine the score.
  80. answer_relevancy_score = mean(cosine_similarity(generated_questions, original_question))
  81. You can run the answer relevancy evaluation with the following simple code:
  82. ```python
  83. from embedchain.eval.metrics import AnswerRelevance
  84. metric = AnswerRelevance()
  85. score = metric.evaluate(dataset) # dataset from above
  86. print(score)
  87. # 0.9505334177461916
  88. ```
  89. In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `AnswerRelevanceConfig` class.
  90. ### AnswerRelevanceConfig
  91. <ParamField path="model" type="str" optional>
  92. The model to use for the evaluation. Defaults to `gpt-4`. We only support openai's models for now.
  93. </ParamField>
  94. <ParamField path="embedder" type="str" optional>
  95. The embedder to use for embedding the text. Defaults to `text-embedding-ada-002`. We only support openai's embedders for now.
  96. </ParamField>
  97. <ParamField path="api_key" type="str" optional>
  98. The openai api key to use for the evaluation. Defaults to `None`. If not provided, we will use the `OPENAI_API_KEY` environment variable.
  99. </ParamField>
  100. <ParamField path="num_gen_questions" type="int" optional>
  101. The number of questions to generate for each answer. We use the generated questions to compare the similarity with the original question to determine the score. Defaults to `1`.
  102. </ParamField>
  103. <ParamField path="prompt" type="str" optional>
  104. The prompt to extract the `num_gen_questions` number of questions from the provided answer. Defaults to `ANSWER_RELEVANCY_PROMPT`, which can be found at `embedchain.config.eval.base` path.
  105. </ParamField>
  106. ```python
  107. openai_api_key = "sk-xxx"
  108. metric = AnswerRelevance(config=AnswerRelevanceConfig(model='gpt-4',
  109. embedder="text-embedding-ada-002",
  110. api_key=openai_api_key,
  111. num_gen_questions=2))
  112. print(metric.evaluate(dataset))
  113. ```
  114. ## Groundedness <a id="groundedness"></a>
  115. Groundedness is a metric to determine how grounded the answer is to the context. We use OpenAI's `gpt-4` model to determine the groundedness of the answer.
  116. We achieve this by prompting the model with the answer and asking it to generate claims from the answer. We then again prompt the model with the context and the generated claims to determine the verdict on the claims. We then use the following formula to determine the score:
  117. groundedness_score = (sum of all verdicts) $$\div$$ (total # of claims)
  118. You can run the groundedness evaluation with the following simple code:
  119. ```python
  120. from embedchain.eval.metrics import Groundedness
  121. metric = Groundedness()
  122. score = metric.evaluate(dataset) # dataset from above
  123. print(score)
  124. # 1.0
  125. ```
  126. In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `GroundednessConfig` class.
  127. ### GroundednessConfig
  128. <ParamField path="model" type="str" optional>
  129. The model to use for the evaluation. Defaults to `gpt-4`. We only support openai's models for now.
  130. </ParamField>
  131. <ParamField path="api_key" type="str" optional>
  132. The openai api key to use for the evaluation. Defaults to `None`. If not provided, we will use the `OPENAI_API_KEY` environment variable.
  133. </ParamField>
  134. <ParamField path="answer_claims_prompt" type="str" optional>
  135. The prompt to extract the claims from the provided answer. Defaults to `GROUNDEDNESS_ANSWER_CLAIMS_PROMPT`, which can be found at `embedchain.config.eval.base` path.
  136. </ParamField>
  137. <ParamField path="claims_inference_prompt" type="str" optional>
  138. The prompt to get verdicts on the claims from the answer from the given context. Defaults to `GROUNDEDNESS_CLAIMS_INFERENCE_PROMPT`, which can be found at `embedchain.config.eval.base` path.
  139. </ParamField>
  140. ```python
  141. openai_api_key = "sk-xxx"
  142. metric = Groundedness(config=GroundednessConfig(model='gpt-4',
  143. api_key=openai_api_key))
  144. print(metric.evaluate(dataset))
  145. ```
  146. ## Custom <a id="custom"></a>
  147. You can also create your own evaluation metric by extending the `BaseMetric` class. You can find the source code for the existing metrics at `embedchain.eval.metrics` path.
  148. <Note>
  149. You must provide the `name` of your custom metric in the `__init__` method of your class. This name will be used to identify your metric in the evaluation report.
  150. </Note>
  151. ```python
  152. from embedchain.eval.metrics import BaseMetric
  153. from embedchain.utils.eval import EvalData
  154. from embedchain.config.base_config import BaseConfig
  155. from typing import Optional
  156. class CustomMetric(BaseMetric):
  157. def __init__(self, config: Optional[BaseConfig] = None):
  158. super().__init__(name="custom_metric")
  159. def evaluate(self, dataset: list[EvalData]):
  160. score = 0.0
  161. # write your evaluation logic here
  162. return score
  163. ```