$ git clone https://github.com/ggerganov/whisper.cpp $ make cc -I. -O3 -std=c11 -pthread -mfma -mf16c -mavx -mavx2 -c ggml.c -o ggml.o g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp -o whisper.o g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp ggml.o whisper.o -o main ./main -h usage: ./main [options] file0.wav file1.wav ... options: -h, --help show this help message and exit -s SEED, --seed SEED RNG seed (default: -1) -t N, --threads N number of threads to use during computation (default: 4) -p N, --processors N number of processors to use during computation (default: 1) -ot N, --offset-t N time offset in milliseconds (default: 0) -on N, --offset-n N segment index offset (default: 0) -d N, --duration N duration of audio to process in milliseconds (default: 0) -mc N, --max-context N maximum number of text context tokens to store (default: max) -ml N, --max-len N maximum segment length in characters (default: 0) -wt N, --word-thold N word timestamp probability threshold (default: 0.010000) -su, --speed-up speed up audio by factor of 2 (faster processing, reduced accuracy, default: false) -v, --verbose verbose output --translate translate from source language to english -otxt, --output-txt output result in a text file -ovtt, --output-vtt output result in a vtt file -osrt, --output-srt output result in a srt file -owts, --output-words output script for generating karaoke video -ps, --print_special print special tokens -pc, --print_colors print colors -nt, --no_timestamps do not print timestamps -l LANG, --language LANG spoken language (default: en) -m FNAME, --model FNAME model path (default: models/ggml-base.en.bin) -f FNAME, --file FNAME input WAV file path
$ bash ./models/download-ggml-model.sh base.en Downloading ggml model base.en from https://huggingface.co/datasets/ggerganov/whisper.cpp ... ggml-base.en.bin 100%[=====================================================================================================================>] 141.11M 434KB/s 時間 3m 56s Done! Model base.en saved in models/ggml-base.en.bin You can now use it like this: $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav whisper_model_load: loading model from models/ggml-base.en.bin whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: mem_required = 506.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 140.60 MB whisper_model_load: memory size = 22.83 MB whisper_model_load: model size = 140.54 MB system_info: n_threads = 4 / 4 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | main: processing samples/jfk.wav (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... [00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country. whisper_print_timings: load time = 542.31 ms whisper_print_timings: mel time = 174.94 ms whisper_print_timings: sample time = 14.82 ms whisper_print_timings: encode time = 6282.06 ms / 1047.01 ms per layer whisper_print_timings: decode time = 373.59 ms / 62.27 ms per layer whisper_print_timings: total time = 7390.82 ms
コメント