code_parser
langroid/parsing/code_parser.py
CodeParser(config)
¶
Source code in langroid/parsing/code_parser.py
num_tokens(text)
¶
How many tokens are in the text, according to the tokenizer. This needs to be accurate, otherwise we may exceed the maximum number of tokens allowed by the model. Args: text: string to tokenize Returns: number of tokens in the text
Source code in langroid/parsing/code_parser.py
split(docs)
¶
Split the documents into chunks, according to the config.splitter. Only the documents with a language in the config.extensions are split.
Note
We assume the metadata in each document has at least a language
field,
which is used to determine how to chunk the code.
Args: docs: list of documents to split Returns: list of documents, where each document is a chunk; the metadata of the original document is duplicated for each chunk, so that when we retrieve a chunk, we immediately know info about the original document.
Source code in langroid/parsing/code_parser.py
chunk_code(code, language, max_tokens, len_fn)
¶
Chunk code into smaller pieces, so that we don't exceed the maximum number of tokens allowed by the embedding model. Args: code: string of code language: str as a file extension, e.g. "py", "yml" max_tokens: max tokens per chunk len_fn: function to get the length of a string in token units Returns: