Atenção Multi-Pergunta
Multi-Query Attention (MQA) é uma variante especializada de mecanismo de atenção commonly used in inteligência artificial, particularly in processamento de linguagem natural and visão computacional. The main purpose of MQA is to enhance efficiency when processing multiple queries simultaneously.
Em mecanismos de atenção tradicionais, cada consulta pode atender de forma independente a um conjunto de chaves e valores, levando a custos computacionais significativos, especialmente ao lidar com um grande número de consultas. O Multi-Query Attention resolve esse problema permitindo que múltiplas consultas compartilhem o mesmo conjunto de chaves e valores, reduzindo assim a carga computacional total.
O architecture of MQA involves several key components. First, it uses a single set of keys and values that are computed once and can be reused across different queries. This shared approach minimizes the redundancy that typically arises when each query computes its own keys and values. As a result, MQA can maintain high performance while operating more efficiently, making it particularly valuable in tasks that require processing large datasets or real-time applications.
Multi-Query Attention has been effectively applied in various state-of-the-art models, including those used for tradução automática, image recognition, and other tasks that benefit from quick retrieval of information. By leveraging this mechanism, AI systems can deliver faster responses and manage resources more effectively, which is crucial in environments where speed and efficiency are paramount.