This is the third post in this series about distilling BERT with multimetric Bayesian optimization. Part 1 discusses the background for the experiment and Part 2 discusses the setup for the Bayesian optimization. In my previous posts, I discussed the importance of BERT for transfer learning in NLP, and established the foundations of this experiment��s design. In this post, I discuss the model��
]]>