Abstract: Improving the generalization performance of deep neural networks (DNNs) trained by minibatch stochastic gradient descent (SGD) has raised lots of concerns from deep learning practitioners.