Abstract
Language models have shown effectiveness in a variety of softwareapplications, particularly in tasks related to automatic workflow. These modelspossess the crucial ability to call functions, which is essential in creatingAI agents. Despite the high performance of large-scale language models in cloudenvironments, they are often associated with concerns over privacy and cost.Current on-device models for function calling face issues with latency andaccuracy. Our research presents a new method that empowers an on-device modelwith 2 billion parameters to surpass the performance of GPT-4 in both accuracyand latency, and decrease the context length by 95\%. When compared to Llama-7Bwith a RAG-based function calling mechanism, our method enhances latency by35-fold. This method reduces the latency to levels deemed suitable fordeployment across a variety of edge devices in production environments,aligning with the performance requisites for real-world applications.