A Comparative Analysis of Deep Learning Techniques for Crop Type Recognition in Temperate and Subtropical regions from Multitemporal SAR Image Sequences
Autores
1Laura Elena, C.; 2Jose David, B.C.; 3Pedro, A.; 4Ieda Del'arco, S.; 5Gilson, C.; 6Patrick, H.; 7Raul, Q.F.
1PUC-RIO Email: lauracue@ele.puc-rio.br
2PUC-RIO Email: bermudez@ele.puc-rio.br
3PUC-RIO Email: pmad9589@ele.puc-rio.br
4INPE Email: ieda.sanches@inpe.br
5UERJ Email: gilson.costa@ime.uerj.br
6PUC-RIO Email: patrick@ele.puc-rio.br
7PUC-RIO Email: raul@ele.puc-rio.br
Resumo
Prediction of yields, estimation of food production and precise and accurate agricultural statistics are crucial to anticipate the market behavior, create new strategies for agriculture and develop economic planning by government and private agencies. Remote Sensing (RS) data have been increasingly applied to assess agricultural yield, production and crop condition. Among the proposed approaches for crop recognition, there are three main groups: Pixel-wise, Object-based and Context-based. The first two methods have respectively the limitation of ignore spatial and temporal context, and fully disregard semantics. Context-based classification approaches take into account contextual information in the spatial and/or temporal domains. Deep Learning (DL) techniques have recently become very popular in the scientific community particularly for image classification. Such techniques are able to learn features automatically from non-labeled samples. State-of-the-art land-cover and crop type classification techniques implement DL approaches using spatial and temporal context. Moreover, most of crop type recognition researches have been conducted using database of temperate regions, when crop’s dynamic is simpler where usually there is a single crop per parcel during the whole season. Indeed, crop’s dynamics in tropical areas, are more complex due to multiple agricultural practices such as irrigation, non-tillage, crop rotation and multiple harvest per year. This work presents a comparative analysis between traditional and DL (supervised and unsupervised) approaches for crop classification on sequences of multitemporal SAR images, specifically from Sentinel 1A satellite. These techniques have been tested in two areas with completely different crop’s dynamics. A tropical region, Campo Verde from Brazil, and a temperate region, Hannover from Germany. Both datasets are described below. Campo Verde: The study area covers 4782 square kilometers of Campo Verde municipality in the state of Mato Grosso, Brazil. Data used consists of a sequence of 14 SAR images from October 2015 to July 2016. The main crops found in this area are: Soybean, Maize and Cotton. Also, there are some minor crops such as Beans and Sorghum. Other classes present in the dataset are Pasture, Eucalyptus, Soil, Turfgrass, Cerrado and as non-commercial crops (NCC), Millet, Brachiaria and Crotalaria were considered. The number of crops per image changes along the whole image sequence due to the different phenological cycles of each culture. Hannover: The study area covers 1728 square kilometers of the surroundings of Hannover city, which is located in Northern Germany. Data used consists of a sequence of 45 SAR images from October 2014 to September 2015. Crops in the area include: summer Barley, winter Barley, Canola, Grassland, Maize, Potato, Rye and Sugar beet. These crops go through different phenological stages within a season. Three different approaches are compared: Conventional, Autoencoder and Single layer CNN approach. The Conventional approach consists in stacking all images of the multitemporal sequence to assemble, for each pixel location, a descriptor that comprises the features of all epochs. The representations built in this way are used to train a classifier that assigns a class label to each pixel in each epoch along the sequence. The Autoencoder approach, temporal and spatial contextual information are exploited as part of the AE training: first, we stack the original features of all epochs, as in the conventional approach, and then train a single AE for the whole sequence. The procedure involves the following steps: (1) create an image tensor by image stacking, (2) select for each image in the sequence, M random patches of size w-by-w-by-d-by-n, where w is referred to as the kernel size, d as the number of bands and n to the image sequence length; each extracted patch generates a n × d × w × w dimensional vector, (3) standardize the set of M vectors to zero-mean and unit-variance, (4) train the AE with the standardized data, (5) Apply the encode mapping function, learned by the AE, to the image tensor built in the first step; a single representation is extracted for each pixel for the whole sequence, (6) Finally, a classifier is trained from the learned representations. The Single layer CNN architecture consists of four layers: convolutional, maxpooling, fully connected and softmax layer. We train a CNN to describe a pixel location taking information of its neighborhood according to these steps: (1) select for each pixel of each image in the sequence, a neighborhood of size w-by-w-by-d, (2) stack the corresponding patches extracted in all images of the sequence, (3) create a training and testing set of patches and train in a supervised fashion a CNN. For all experiments the selected classification algorithm was Random Forest. The protocol adopted in our experiments is as follows: given a sequence of n images, first we stack the feature pixel representation that comprise the whole sequence according to the selected method, then we train and evaluate a classifier/DL approach using the stacked features and the reference for the last image in the sequence, respectively. For a given image sequence we classified only the image of the last epoch. We started with a single image in the sequence and repeated the experiment by adding earlier images successively. For Campo Verde dataset, the results confirmed that temporal information plays an important role. The classification performance improved for most classes as the sequence length increased. On the other hand, DL techniques outperformed the conventional image stacking strategy in almost all experiments. CNN was the one with best performance among all evaluated methods. Experiments on Hannover dataset are being carried out in order to compare the performance of the methods for both regions.
Keywords
Multitemporal SAR Image; Crop Recognition; Deep Learning