Autonomous robotic systems need a flexible and safe method to interact with their surroundings. When encountering unfamiliar objects, the agents should be able to identify and learn the involved affordances to apply appropriate actions. Focusing on affordance learning, we introduce a neuro-symbolic AI system with a robot simulation capable of inferring appropriate action. The system's core is a visuo-lingual attribute detection module coupled with a probabilistic knowledge base. The system is accompanied by a Unity robot simulation that is used for evaluation. The system is evaluated through caption-inferring capabilities using image captioning and machine translation metrics on a case study of opening containers. The two main affordance-action relation pairs are the jar/bottle lids that are open using either a ‘twist’ or a ‘snap’ action. The results show the system is successful in opening all 50 containers in the test case, based on an accurate attribute captioning rate of 71%. The mismatch is likely due to the ‘snapping’ lids being able to open also after a twisting motion. Our system demonstrates that affordance inference can be successfully implemented using a cognitive visuo-lingual method that could be generalized to other affordance cases.