Glossary

์Œ์„ฑ์ธ์‹

  • ์ž…๋ ฅ : ์Œ์„ฑ ์‹ ํ˜ธ ์‹œํ€€์Šค ( speech signal sequence )

  • ์ถœ๋ ฅ : ๋ฌธ์ž์—ด

  • ํ•ด๋‹น ์ž…๋ ฅ ์‹ ํ˜ธ ์‹œํ€€์Šค์— ๋Œ€ํ•ด ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ ๋ฌธ์ž์—ด์„ ์ฐพ๋Š” ๊ฒƒ

AM(Acoustic Model)

์ปดํ“จํ„ฐ์—๊ฒŒ ์†Œ๋ฆฌ๋ฅผ ์ตํžˆ๋„๋ก ํ•˜๋Š” ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. '๊ฐ€๋‹ค'๋ผ๋Š” ์†Œ๋ฆฌ๋ฅผ ์Œ์†Œ(phoneme)๋‹จ์œ„ 'ใ„ฑ+ใ…+ใ„ท+ใ…'๋กœ ๋ถ„๋ฆฌํ•ด์„œ ๊ฐ๊ฐ์˜ ์Œ์†Œ๋Š” ์‹ค์ œ๋กœ ์–ด๋–ค ์†Œ๋ฆฌ๋‹ค๋ผ๊ณ  ๋งค์นญํ•ด์„œ ์•Œ๋ ค์ฃผ๋Š” ๊ฒƒ์ด์ฃ . ์ด๊ฒƒ์„ ํ†ตํ•ด ํŠน์ • ์†Œ๋ฆฌ๊ฐ€ ์–ด๋–ค ์Œ์†Œ์ธ์ง€ ์ปดํ“จํ„ฐ๊ฐ€ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์ด ์–ธ์–ด๋ฅผ ๋ฐฐ์šธ ๋•Œ, ์ƒˆ๋กœ์šด ์–ธ์–ด๋ฅผ ๋งŽ์ด ๋“ฃ๋‹ค๋ณด๋ฉด ๊ทธ ์†Œ๋ฆฌ์— ์ต์ˆ™ํ•ด์ง€๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Acoustic Model (์Œํ–ฅ ๋ชจ๋ธ)

  • ์Œ์„ฑ์ธ์‹๊ธฐ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์–ด๋–ค ๋‹จ์–ด ์‹œํ€€์Šค๊ฐ€ ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€์ง€ ์ฐพ๋Š” ๋ชจ๋ธ

  • ๊ฐ ์ž…๋ ฅ ์‹ ํ˜ธ์— ๋Œ€ํ•ด ์Œ์†Œ( ๋ฐœ์Œ ๋‹จ์œ„ )์˜ ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•œ ๋ชจ๋ธ

LM(Language Model)

๋‹จ์–ด๋“ค๋กœ ์ด๋ฃจ์–ด์ง„ ์‹ค์ œ ์–ธ์–ด, ์ฆ‰ ์‹ค์ œ ์‚ฌ๋žŒ๋“ค์ด ์‚ฌ์šฉํ•˜๋Š” ๋ง์„ ์ธ์‹ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์„ N-gram ํ™•๋ฅ ์–ธ์–ด๋ชจ๋ธ(statistical language model)์ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ, ์‚ฌ๋žŒ๋“ค์ด ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ์ˆ˜ ๋งŽ์€ ๋ฌธ์žฅ๋“ค์„ ๋ถ„์„ํ•ด์„œ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๋งŽ์ด ์“ฐ์ด๋Š”์ง€, ์–ด๋–ค ๋‹จ์–ด ๋’ค์—, ํ˜น์€ ์•ž์—๋Š” ์–ด๋–ค ๋‹จ์–ด๋“ค์ด ์‚ฌ์šฉ๋˜๋Š”์ง€๋ฅผ ํ™•๋ฅ ์ ์œผ๋กœ ๋ถ„์„ํ•ด์„œ ๊ธฐ๋กํ•ด ๋†“์Šต๋‹ˆ๋‹ค. ์•ž ๋’ค ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์ตœ๋Œ€ n๊ฐœ ๊นŒ์ง€ ํ™•์ธํ•˜๊ธฐ ๋•Œ๋ฌธ์— n-gram ์ด๋ผ ๋ถˆ๋ฆฝ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ํ™•๋ฅ ๋ชจ๋ธ๊นŒ์ง€ ์ค€๋น„๊ฐ€ ๋˜๋ฉด ์ด๋ก ์ ์œผ๋กœ, ์†Œ๋ฆฌ๋ฅผ ์Œ์†Œ๋กœ, ์Œ์†Œ๋ฅผ ๋ชจ์•„ ๋‹จ์–ด๋กœ, ๋‹จ์–ด๋ฅผ ๋ชจ์•„ ๋ฌธ์žฅ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ณ , ์‹ค์ œ ์‚ฌ๋žŒ์ด ๋งํ•œ ์†Œ๋ฆฌ์™€ ํ™•๋ฅ ์ ์œผ๋กœ ๊ฐ€์žฅ ๋น„์Šทํ•œ ๋ฌธ์žฅ์„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Language Model (์–ธ์–ด ๋ชจ๋ธ)

  • ์Œํ–ฅ ๋ชจ๋ธ์„ ๋ณด์™„ํ•˜์—ฌ ์–ด๋– ํ•œ ๋‹จ์–ด ์‹œํ€€์Šค๊ฐ€ ๋ฌธ๋งฅ์ ์œผ๋กœ ๊ฐ€์žฅ ๊ทธ๋Ÿด๋“ฏํ•œ ์ง€ ์ฐพ๋Š” ๋ชจ๋ธ

# ์•„๋ž˜ score๋Š” ์˜ˆ์‹œ๋ฅผ ์œ„ํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค.

์‹ค์ œ ๋ฐœํ™” ๋‚ด์šฉ)
์ด๋ž˜๋ผ ์ €๋ž˜๋ผ ํ•˜์ง€๋งˆ

์Œํ–ฅ ๋ชจ๋ธ ๊ฒฐ๊ณผ)
์ด๋ž˜๋ผ ์ €๋ž˜๋ผ ํ•˜์ง€๋งˆ - 0.7 score
์ผํ•ด๋ผ ์ ˆํ•ด๋ผ ํ•˜์ง€๋งˆ - 0.8 score ( ์Œ์†Œ์˜ ๋ถ„ํฌ๋งŒ ๊ณ ๋ คํ•˜๋ฉด ์Œํ–ฅ ๋ชจ๋ธ์—์„œ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ )
...

์–ธ์–ด ๋ชจ๋ธ ๊ฒฐ๊ณผ)
์ด๋ž˜๋ผ ์ €๋ž˜๋ผ ํ•˜์ง€๋งˆ - 0.9 score
์ผํ•ด๋ผ ์ ˆํ•ด๋ผ ํ•˜์ง€๋งˆ - 0.2 score ( '์ผํ•ด๋ผ' '์ ˆํ•ด๋ผ' ์‚ฌ์ด์— ๋ฌธ๋งฅ์ƒ ๊ด€๋ จ์„ฑ์ด ๋–จ์–ด์ง€๋ฏ€๋กœ ๋‚ฎ์€ ํ™•๋ฅ  )
...

์‹ค์ œ ์Œ์„ฑ์ธ์‹๊ธฐ ๊ฒฐ๊ณผ )
์ด๋ž˜๋ผ ์ €๋ž˜๋ผ ํ•˜์ง€๋งˆ - AM score X LM score = 0.7 x 0.9 = 0.63 ( ๋” ํ™•๋ฅ ์ด ๋†’์œผ๋ฏ€๋กœ ์ตœ์ข… ์ธ์‹๊ฒฐ๊ณผ )
์ผํ•ด๋ผ ์ ˆํ•ด๋ผ ํ•˜์ง€๋งˆ - AM score X LM score = 0.8 x 0.2 = 0.16

Top Graph (ํƒ‘๊ทธ๋ž˜ํ”„)

generalํ•œ ์–ธ์–ด ๋ชจ๋ธ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ๊ทธ๋ž˜ํ”„ (์ถ”๊ฐ€ ์„ค๋ช… ํ•„์š”)

Sub Graph (์„œํ”„๊ทธ๋ž˜ํ”„)

  • ํŠน์ • ์‚ฌ์šฉ ๋„๋ฉ”์ธ (๊ธˆ์œต, ๋ฐฉ์†ก, ์Šคํฌ์ธ  ๋“ฑ)์˜ ์–ธ์–ด ๋ชจ๋ธ ์ •๋ณด๋ฅผ ๋‹ด๊ณ ์žˆ๋Š” ๊ทธ๋ž˜ํ”„

  • Top graph ์— ์ถ”๊ฐ€ํ•˜์—ฌ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•จ

Transfer Learning (์ „์ดํ•™์Šต)

ํ•™์Šต๋œ ์Œํ–ฅ ๋ชจ๋ธ์˜ Final layer ๋ฅผ Target ๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ๋กœ ์žฌํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•

Active Learning (์ „์ฒดํ•™์Šต)

์ง€์†์ ์œผ๋กœ ์ถ”๊ฐ€๋˜๋Š” ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ scoring ํ•˜์—ฌ decoding ์„ฑ๋Šฅ์ด ์ข‹์ง€์•Š์€ ( ํ•™์Šต์ด ๋ถ€์กฑํ•œ ) sample ์„ ์ถ”์ถœํ•˜์—ฌ ์Œํ–ฅ๋ชจ๋ธ์„ ์žฌํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•

Threshold (Low risk, High risk)

  • ์ž„๊ณ„์ , ๊ฒฝ๊ณ„์ , ๊ธฐ์ค€์ 

  • Active Learning ์—์„œ ํ•™์Šต์ด ํ•„์š”ํ•œ sample ์„ ์ถ”์ถœํ•˜๋Š” ๊ธฐ์ค€์ด ๋˜๋Š” ๊ฐ’

CER (Character Error Rate)

์Œ์„ฑ ์ธ์‹์€ ๋ฐœํ™”์˜ ์›๋ณธ ์ง€๋ฌธ(๋‚ญ๋…์„ ํ•œ ์›๋ณธ ์ง€๋ฌธ ๋˜๋Š” ์‚ฌ๋žŒ์ด ๋“ฃ๊ณ  ๋ฐ›์•„์“ฐ๊ธฐํ•œ ํ…์ŠคํŠธ) ๋˜๋Š” ์‚ฌ๋žŒ์ด ๋“ฃ๊ณ  ๋ฐ›์•„์“ฐ๊ธฐํ•œ ํ…์ŠคํŠธ๊ณผ ์ธ์‹ ๊ฒฐ๊ณผ์˜ ๋น„๊ต๋ฅผ ํ†ตํ•ด์„œ๋งŒ์ด ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜์–ด๋Š” ๋‹จ์–ด(word) ๋‹จ์œ„๋กœ ๋„์–ด์“ฐ๊ธฐ๋ฅผ ํ•˜๊ณ , ์ •์˜๊ฐ€ ๋ช…ํ™•ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์–ด์˜ค๋ฅ˜์œจ WER(Word Error Rate)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ํ•œ๊ตญ์–ด์˜ ๋„์–ด์“ฐ๊ธฐ ๋‹จ์œ„๋Š” ์–ด์ ˆ์ด๊ณ  ๋‹จ์–ด์˜ ๋‹จ์œ„๋Š” ์–ด์ ˆ๋ณด๋‹ค ์ž‘๊ธฐ ๋•Œ๋ฌธ์— ํ•œ๊ตญ์–ด ์Œ์„ฑ ์ธ์‹ ์—”์ง„์˜ ์ธ์‹๋ฅ  ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด์„œ๋Š” ๊ธ€์ž์˜ค๋ฅ˜์œจ, ์ฆ‰ CER(Character Error Rate)์„ ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ธ์‹๋ฅ  ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•

  1. CER(%) = 100 [(ํƒˆ์ž ๊ฐœ์ˆ˜ + ์˜ค์ž ๊ฐœ์ˆ˜ + ์ฒจ์ž ๊ฐœ์ˆ˜)/์›๋ณธ ๊ธ€์ž์ˆ˜]

  2. Accuracy(%) = 100 [1 - (ํƒˆ์ž ๊ฐœ์ˆ˜ + ์˜ค์ž ๊ฐœ์ˆ˜ + ์ฒจ์ž ๊ฐœ์ˆ˜)/์›๋ณธ ๊ธ€์ž์ˆ˜]

  3. Edit Distance(Levenshtein Distance) ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์‚ฌ์šฉ

See the script steps/scoring/score_kaldi_cer.sharrow-up-right in case you need to evalutate CER

Reference & Hypothesis

Reference๋Š” ์›๋ณธ ๋ฌธ์žฅ์ด๋‚˜ ์ „์‚ฌ๋œ ์ •๋‹ต์ง€๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ Hypothesis๋Š” ์Œ์„ฑ์ธ์‹๊ธฐ์— ์˜ํ•ด ๋””์ฝ”๋”ฉ๋œ ๊ฒฐ๊ณผ ๋ฌธ์žฅ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Pre-training

๋ ˆ์ด๋ธ”๋ง์ด ์•ˆ๋˜์–ด ์žˆ๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ํ•™์Šต์„ ์‹œํ‚ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Fine-tuning

๋ ˆ์ด๋ธ”๋ง์ด ๋œ ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์…‹์— ๋งž๊ฒŒ ๋‹ค์‹œ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Last updated