Swift的Tokenizer。
从 swift-coreml-transformers 中提取并略作修改的 GPT2 tokenizer,用作一个独立的包。(现在特别用于 BART)
将以下内容添加到您的 Package.swift 依赖项中
:
dependencies: [
.package(url: "https://github.com/RayKitajima/SwiftTokenizer.git", from: "1.0.0"),
],
:
import SwiftTokenizer
let config = TokenizerConfig(
vocab: Bundle.module.url(forResource: "vocab", withExtension: "json")!,
merges: Bundle.module.url(forResource: "merges", withExtension: "txt")!
)
let tokenizer = Tokenizer(config: config)
let tokens = tokenizer.encode(text: "Hello, world!")
print(tokens)
// [31414, 6, 232, 328]
let decoded = tokenizer.decode(tokens: tokenizer.stripBOS(tokens: tokenizer.stripEOS(tokens: tokens)))
print(decoded)
// Hello, world!
对于 BART,您需要将 BOS 和 EOS token 添加到 input_ids。
let input_ids = tokenizer.appendEOS(tokens: tokenizer.appendBOS(tokens: tokenizer.encode(text: "Hello, world!")))
print(tokens)
// [0, 31414, 6, 232, 328, 2]
// inverse
let decoded = tokenizer.decode(tokens: tokenizer.stripBOS(tokens: tokenizer.stripEOS(tokens: input_ids)))
print(decoded)
// Hello, world!
https://github.com/huggingface/swift-coreml-transformers
Apache License 2.0
版权所有 © 2019 Hugging Face。保留所有权利。
修改版权所有 (C) 2023 Rei Kitajima (rei.kitajima@gmail.com)