Skip to main content
Agriculture Agriculture

How to use 3D Vision AI for automation in agriculture

Date Section Blog

Let’s take a deep dive into 3D Vision AI techniques for agriculture automation. In the world of Vision AI, the extra layer of depth provided by 3D vision is indispensable for efficient automation of agricultural processes. But how does the technology actually work? 

Agricultural challenges

Agriculture, one of the oldest professions globally, relies heavily on human vision for tasks like planting, cutting, pruning, and harvesting. However, the quality of labour and the shortage of workers affect the productivity, profitability and scalability of food production. The urgency of these challenges only amplifies in the face of a growing world population and climate change leading to water shortages.

Why 3D Vision AI in agriculture?

To overcome these challenges, tech companies are paving the way for automation and robotics in agriculture and horticulture, even for the most complex crop work activities. When it comes to intricate manipulations requiring depth perception and complex spatial understanding, the undeniable star of the show is 3D Vision AI

Consider a scenario where a robot arm needs to discern where to cut a plant stem. Traditional 2D deep learning can identify leaves from branches, but for that precise cut at just the right angle, the third dimension is non-negotiable. Depending on the case, you choose a specific type of camera and AI software. In general, the more complex and variable the shapes you’re working with, and the more precise the actions your robotics have to perform, the more sophisticated the Vision AI set-up will be.

Create a 3D image with depth sensors

The first category of 3D vision technologies merges the detailed colour capture capabilities of an RGB camera with the spatial depth information obtained from a depth sensor to fabricate a three-dimensional depiction of reality, known as an RGB-D image. This integration is achieved by correlating the pixel coordinates between the two sensor types, thereby enriching the colour data (RGB) with depth perception (D).

Among the depth sensing technologies, Time of Flight (ToF) and LIDAR sensors are predominant. These sensors operate by emitting infrared pulses and then measuring the elapsed time before these pulses are reflected back from the scene. The principle here is straightforward: the longer the reflection time, the greater the distance of the object from the sensor. Despite their utility, a notable limitation of ToF and LIDAR sensors, particularly in agricultural contexts, lies in their resolution and depth accuracy. The pixel resolution and the precision in depth measurement offered by these sensors often fall short of the requirements for detailed agricultural analysis, where both are crucial for tasks such as crop health assessment and precision farming operations.

Stereovision

The second 3D measurement technique is based on the principle of stereovision, a technique that emulates the binocular vision of humans. A typical stereo camera uses two cameras that simultaneously capture images of the same object from a slightly different angle. This is similar to how humans use both eyes to perceive depth. Stereovision systems have built-in algorithms that calculate the depth dimension based on the observation that one point in the scene has a different location on each image of the stereo pair, which is called ‘triangulation’. The bigger the difference, the closer the object. 

However, stereovision systems encounter specific challenges in accurately identifying corresponding points across the two images under certain conditions. This difficulty is pronounced in scenarios involving objects with minimal textural variation, such as uniformly coloured surfaces, or in complex environments with closely resembling elements, such as the leaves of plants in a greenhouse. These situations demand advanced algorithmic solutions to accurately resolve depth from the stereoscopic data.

Active stereo, AI-enhanced and AI-based stereovision 

To surmount the constraints inherent to traditional stereovision, advancements have led to three distinct approaches: active stereo with pattern projection, AI-enhanced stereovision and AI-based stereovision. Active stereo technology augments the scene with infrared dot patterns, casting artificial features onto the surface. This infusion of texture greatly aids in pinpointing corresponding points across the stereo images, facilitating more accurate depth calculations.

AI-enhanced stereovision capitalises on the power of 2D vision AI to preliminarily identify or predict pertinent features within the scene. These AI-identified features are then used to establish precise 3D coordinates on each image within the stereo pair, improving depth measurement accuracy.

This process of feature identification can be executed through supervised learning methods, where features of interest are manually labelled within the scene, or via unsupervised techniques that do not require explicit labelling. 

Alternatively, the entire triangulation process can be replaced by a deep learning model. The model is trained to directly estimate depth from the stereo images, bypassing traditional computation methods. This approach represents a significant leap forward, leveraging the inherent pattern recognition capabilities of deep learning to streamline and enhance the depth estimation process.

Structured light cameras

Structured light cameras use the same principle as stereovision, but one of the cameras is replaced by a projector. These systems generate a 3D model by casting a sequence of striped light patterns across the scene at very high speed. A dedicated camera records the distortion of these stripes as they contour the surfaces within the scene, enabling the derivation of depth information through the analysis of how the projected lines warp. The advent of exceedingly precise and powerful projectors has ushered in the capability to achieve depth measurements with sub-millimeter accuracy.

Structured light

However, this technique is not without its challenges. The integrity of the structured light patterns can be compromised by ambient lighting conditions, which may disrupt the pattern and interfere with the depth calculation process. Furthermore, the reliance on light reflection to gauge depth poses additional difficulties in accurately capturing surfaces with high reflectivity. To mitigate these issues, the integration of polarised light projection in conjunction with polarising cameras has been identified as a viable strategy, enhancing the system’s ability to discern depth on reflective surfaces by reducing the interference of unwanted light reflections.

Multi-view reconstruction techniques

The third category of 3D imaging technology involves multi-view reconstruction techniques, utilising the collective data captured by multiple cameras arrayed around the subject. These techniques employ sophisticated algorithms to achieve a 3D model whose projections onto each camera—referred to as renderings—align with the actual images captured by these cameras.

Voxel carving

Among the most elementary algorithms employed in this domain is voxel carving, which is predominantly utilised for the three-dimensional representation of singular objects. This method evaluates each voxel (the three-dimensional counterpart of a pixel) to determine whether a light ray passing through it intersects with any pixel in the 2D images. Voxels confirmed to intersect are systematically eliminated. Iterating this process across the entire set of voxels in the designated volume culminates in a 3D reconstruction of the object. For the fidelity of the 3D model to be preserved, the spatial relationships among the cameras—achieved through meticulous multi-camera calibration—must be precisely established. While this approach is particularly effective for modelling convex objects, small plants, or plants with a sparse structure, it encounters significant challenges when applied to complex shapes characterised by occlusions and concave surfaces.

Photogrammetry

Photogrammetry, or multi-view stereo, transforms multiple photographs into detailed 3D models in a four-step process. It begins by aligning photographs to identify common points and refine camera positions, creating a basic 3D framework. This framework is then densified with additional common points to add more detail. A mesh overlays these points to form a surface, and finally, textures and colors from the photos are applied, yielding a realistic 3D model. However, photogrammetry’s downsides include the need for numerous images and long computing times, making it less suitable for time-sensitive agronomic applications. Thus, it’s primarily used in non-urgent contexts, such as video game asset creation and digital preservation of cultural sites.

Radiance field methods

Radiance field techniques are emerging as a promising area of development that could potentially address the limitations inherent in voxel carving for creating more accurate 3D models from 2D images. These methods aim to precisely model both colour and opacity within a three-dimensional space, adjusting this model through optimization against a dataset of 2D images. The optimization initiates from a randomly generated semi-transparent volume, which is iteratively refined until it accurately represents all provided 2D views. While currently facing challenges such as the necessity for a large number of input views and lengthy computation times, rapid advancements in this field are anticipated to soon offer viable solutions.

Choosing Between 2D and 3D AI

The true magic happens when AI technology is applied to both the images and the synthesised 3D model, extracting critical data for decision-making or robotic operations. At this juncture, a pivotal decision arises between utilising 2D or 3D AI algorithms. Given that all 3D measurement techniques use 2D cameras, the captured images can be used for 2D vision AI analysis, with the findings subsequently projected onto the 3D scene. Alternatively, the 3D model itself can serve directly as input for 3D vision AI analysis, each approach bearing its distinct advantages and challenges.

Frequently, the resolution of 2D images surpasses that of the corresponding 3D models. When high resolution is essential for identifying specific features, such as early-stage plant disease markers, a 2D deep learning approach may be preferable. The rich ecosystem of pre-trained model weights for 2D analysis potentially reduces the volume of data needed. Additionally, AI-enhanced analysis of 2D imagery can contribute to the refinement of the 3D model, exemplified by AI-enhanced stereovision techniques.

Conversely, 3D AI models excel when the identification of features is predominantly governed by the object’s three-dimensional structure, such as locating branch junctions, estimating growth direction, or estimating fruit volume. It’s a common strategy to integrate both 2D and 3D AI models within a singular AI pipeline, harnessing the strengths of each to achieve a comprehensive analytical framework.

2D or 3D labelling

The selection between 2D and 3D AI algorithms isn’t strictly tied to the choice of annotation tools, as annotations can be projected back and forth between 2D views and the 3D model. This decision is predominantly influenced by the ease with which humans can label the data. Labelling within 3D environments might present complexities and navigation challenges, in contrast to the more straightforward and focused approach of annotating in 2D. However, efficient 3D labelling can significantly streamline the annotation process for 2D models by enabling the generation of multiple annotated 2D views from a single 3D annotation. It can be beneficial to switch between 2D and 3D views in your labelling tool in order to use all available information to make the best annotations and verify the results of your annotation work in other views or in the 3D model.

Mastering the future of Agritech

Implementing advanced vision technologies in a high-tech agri environment, like automated greenhouses or research laboratories, demands a high level of expertise. The process involves selecting the optimal vision system tailored to specific needs, configuring camera placements for maximum efficiency, and devising strategies for seamless robot-camera collaboration. Furthermore, it’s crucial to identify the most appropriate AI models and annotation techniques to ensure success.

The evolution towards AI-powered automation in agriculture is not just a luxury but a necessity. In the end, it’s about addressing tangible problems in horticulture and agriculture. Thanks to Vision AI-powered automation, agricultural tasks can be handled with precision and efficiency, ensuring an innovative, sustainable and productive future.