Applications of open-domain conversational agents are becoming widespread. However, training such agents to generate high-quality responses is still a big challenge as the quality of responses depends on various factors. Recent methods train agents directly by gold responses from training sets. These methods have been shown generating low-quality responses at evaluation. In this thesis, we propose to train a function that quantifies the quality of the generated responses by a deep preference learning method. Then, we use this function as a reward estimator in a reinforcement learning model to train agents.