A major challenge of screen-camera visual multiinput multi-output (MIMO) communications is to increase the achievable throughput by reducing nonlinear channel effects including perspective distortion, ambient lights, and color mixing. To mitigate such nonlinear effects, an existing transmission method uses linear or simple nonlinear equalizations in decoding operations. However, the throughput improvement from the equalization techniques is often limited because the effects are composed of a combination of various nonlinear distortions. In addition to the above issue, the existing studies consider specific environments, such as indoor and static communications, although screen-camera communications can be used for a variety of applications including outdoor and mobile scenarios. In this study, we propose 1) deep neural network (DNN)- based decoding for screen-camera communications to increase the achievable throughput and 2) Unity 3D-based evaluation methodology to synthetically learn the DNN for being robust against many different screen-camera environments. The DNN finds the best nonlinear kernels for equalization from numerously captured images, and then decodes original bits from newly captured images based on the trained nonlinear kernels. In the Unity-based evaluation tool, we can easily capture numerous photo-realistic images in different screen-camera scenarios to learn the impact of perspective distortion, screen-to-camera distance, motion blur, and ambient lights on the throughput since Unity-based environment can freely set programmable screens, cameras, and ambient lights on a 3D space. As an initial proof of concept, we demonstrate that the proposed DNN-based decoder scheme improves the achievable throughput by up to 148% compared to existing methods by equalizing nonlinear effects.