Les K-shingles, également appelés K-grams ou K-mers, sont une technique utilisée en traitement du langage naturel and l’analyse de texte. They represent contiguous sequences of K items (typically characters or words) extracted from larger text documents. The primary purpose of K-shingles is to capture the local structure of text, allowing for comparison and similarity measurements between different documents.
For instance, if we consider a document consisting of the string “hello world” and we choose K=3, the resulting set of 3-shingles would include: “hel”, “ell”, “llo”, “lo “, “o w”, ” wo”, “wor”, “orl”, “rld”. By breaking down the text into these overlapping smaller segments, K-shingles help in identifying patterns, phrases, and similarities between texts.
Les K-shingles sont particulièrement utiles dans des applications telles que la détection de plagiat, regroupement de documents, and web page duplicate detection. By converting documents into sets of K-shingles, one can utilize various algorithms, such as Jaccard similarity or cosine similarity, to quantify the similarity between different texts. The choice of K value is crucial; a smaller K may result in high sensitivity to noise and less meaningful comparisons, while a larger K can capture more context but may miss finer details.
Dans l’ensemble, les K-shingles servent d’outil puissant en fouille de texte et apprentissage automatique, enabling better understanding and processing of large volumes of textual data.