The Effects of Binning Hyperspectral Data

Introduction

Hyperspectral data can be quite difficult to sift through if there are numerous individual bands collected by the sensor. For this the technique of “binning” or combining neighboring wavelength bands into a single calculated wavelength has been used to simplify the data slightly and improve the ability of researchers to evaluate wavelengths. This is accomplished as there are still numerous individual wavelengths, but now with slightly larger gaps between each to see differences more clearly. This comes with the bonus of reducing the size of the data, which reduces the needs for storing data or time to transfer data. These smaller files are also easier to process and perform certain analysis on, allowing work to be more effectively done on non-high-end computers. The question remaining then is what are the effects of binning data, and are the benefits of doing so worth the trade-offs that come with it?

Dataset Information

As hyperspectral imagery (HSI) is used in a vast variety of applications, this paper will be focused on the specific use case of Unmanned Aerial Systems (UAS) remote sensing work. The data we will be using to assess the effects of binning comes from Headwall’s Nano HP (Table A) integrated into a multi-sensor package. This data was collected in an uncompressed manner, and then processed using GRYFN’s Processing Tool (GPT). GPT uses data from an Applanix-15 on-board the sensor package to geolocate data, it uses LiDAR captured from the same sensor package to create a surface model, and finally the HSI is corrected using specially calibrated reflectance targets placed in the field, before mosaicking the HSI together. This is being outlined as we will be looking at the mosaics that are created at the end of processing. A setting exists within GPT to set the value you wish to bin the product by. This means the raw data was the same for each mosaic you see, and not subject to any different preprocessing that could tamper with the final mosaic.

The dataset we will be using to see the effects of binning data is of an area with a diverse set of features within the imagery. Collected on September 26th, 2023, the imagery includes a late stage mature agriculture field, grass, bare earth, asphalt, manufactured objects, and three specially calibrated reflectance panels. These three panels have reflectivity values of 56%, 30%, and 11% respectively. The weather on this day at the time of the data collection is recorded by the closest METAR weather station as having overcast cloud conditions, ten statute mile visibility, and wind speeds of five knots. While overcast conditions are not great for HSI, as these sensors are very light sensitive and dependent, the cloud conditions were consistent for the duration of the data collection.

Image A: RGB representation of the HSI mosaic from non-binned data.

Dataset Evaluation

As mentioned, this HSI was processed using GPT to produce a mosaic of the area of interest. Using the same HSI processing workflow this dataset was processed four different times, non-binning (NB), bin-by-2 (B2), bin-by-3 (B3), and bin-by-4 (B4). Again, this was using the same raw data each time, with the only difference being the binning value set in GPT. With this we now have four mosaics available to analyze. To analyze the data, I used QGIS, which is a free and open-source geographic information system (GIS). In QGIS we can see information about our mosaics. The first thing we will take note of are some of the characteristics of each mosaic. The characteristics to note here are the number of bands each mosaic contains, the spacing of the bands, and the ground sample distance (GSD) resolution of the mosaic. We can find this information by right clicking the mosaic in the layers section and going to the properties. For example, we can see, as we should expect, that the NB mosaic has 340 individual bands, with an average spacing between bands of ~1.761nm. Here we can also see that the ground sample distance (GSD) of the mosaic is 2cm, meaning each pixel is representative of a 2cm by 2cm square of space on the ground. This GSD is true for all the mosaics.

I created a shapefile to define locations to analyze in the HSI mosaic. By defining certain areas of more homogeneous features, we can better understand the impact of binning in specific ways. A similar tactic is used in broader supervised classification processes, where you identify certain feature sets, and using your sample feature selections can classify other areas automatically.

Image B: Outline of sample areas in mosaic.

I selected nine different sample locations in the mosaics. These nine sample locations were used for each dataset. These sample locations were selected to have a diversity of different materials with a variety of spectral signatures. While we are acknowledging the datasets variety you will notice some black lines and spots in the mosaic. These are areas of no data, likely caused by either excessive movement of the UAS or hardware limitations of the Nano HP sensor. Hardware limitations in that writing data to the internal storage was slowed down, and thus data was not able to be written. Sample areas were selected to avoid these areas of no data, as in an ideal data set, there would be zero instance of no data in the HSI.

Individual Band Analysis

First, we will take a quick look at how binning seems to affect individual bands. This is difficult to do as most bands have different values, resulting from the different binning values. While getting the exact same wavelength is not guaranteed, it is not impossible. In the NB dataset band 2 is 400.175nm, and in the B3 dataset band 1 is also 400.175nm. So, we will take a quick look at how these two compare by comparing values within the realistic area of interest ID 3.

After extracting band 2 from the NB dataset and band 1 from the B3 dataset we can see that there is an immediate difference. In QGIS we see that the max value for the whole band in the NB dataset is 7,359 while in B3 it’s 6,923. As mentioned previously we will narrow our analysis to just the area of ID 3 for this initial investigation. I used a tool in QGIS known as “Zonal Statistics”. The Zonal Statistics tool calculates user specified statistics of areas defined using a defined area file based on the data of your selected raster file. In my case I used the shapefile I made of my sample locations as the defined area, and then I calculated the number of pixels (count), minimum value, maximum value, mean value, and variance of data values. I will also be using a tool called Zonal Histogram to represent the data values of the bands in a graph.

Image C: Minimum and maximum values of full area for 400.175nm band for NB and B3 dataset.

After running the bands through the tool, we can see the NB data has a lower minimum, and mean, as well as a higher max. This leads to it having a fairly large variance value. The B3 dataset has a narrower range between its minimum and maximum values, the mean is higher than the NB dataset and for this has a smaller variance. We can also look at the histogram for each dataset in Image D. The histograms both start with high low value spikes, but from there the graphs are fairly different. The NB histogram takes a very spikey shape, while the B3 histogram has some spikes, but they form into a hill shape. So, while NB has gaps in data values, B3 has smoothed out values a lot more.

Table D: Individual band 400.175nm statistics for sample location ID 3.

Image D: Histogram of data values for NB and B3 400.175nm band.

But is there something else happening when binning through GPT? To quickly analyze this, I found that by averaging the first two bands of the NB dataset I could get the same wavelength as the first band of the B2 dataset. This is not a perfect representation of binning; however, this will help us check to see if something else could be causing differences via the processing of GPT. Immediately after calculating the new average band for NB resulting in the wavelength 399.294nm we see the max values for the whole dataset are only 0.5 off from B2.

Image E: Minimum and maximum values of full area for 399.294nm band for NB (calculated) and B2 (raw) dataset.

Again, with this data I ran Zonal Statistics and Zonal Histogram for sample location ID 3. The results as seen in Table E show that the values are very close and are only 0.5 off from each other for the minimum and maximum value. This suggests that the binning process by GPT is not doing anything additional to the binned data when processing it that is any different from what an individual might do. However, a look at the histograms in Image F show that the algorithm for binning the data is not just linear. As the histograms take on a similar shape, B2’s histogram is much sharper than NB’s. Neither of these histograms take on the hill like shape we saw for B3 in Image D however.

Table E: Individual band 399.294nm statistics for sample location ID 3.

Image F: Histogram of data values for NB and B2 399.294nm band.

Full Dataset Evaluation

As addressed in the previous section, comparing individual bands is quite limiting. To solve this problem, I decided to calculate these datasets using the Normalized Difference Vegetation Index (NDVI). NDVI is an index traditionally used in agricultural data collection, as it is designed to showcase areas of denser and healthier vegetation from other features in a dataset. It does this by showing everything on a maximum scale of -1 to +1, with everything less than 0 being certainly man-made, and positive values between 0.6 and 1 being vegetation of various levels of density and greenness.

To do this you use the following equation: NDVI = [(NIR – Red) / (NIR + Red)]. This still isn’t perfect as I would not be able to use the same exact wavelengths to perform the raster calculation, but I selected the closest wavelength to 669.5nm for my red band, and 842nm for my near-infrared (NIR) band. This will still show us important differences in the data as it is more precise than using the likes of a multispectral camera which will sense a much larger swath of bands in their red and NIR images, making it difficult to parse information about specific wavelengths, which is the purpose of hyperspectral sensors.

This time I calculated Zonal Statistics for all of the sample locations in my shapefile that I mentioned at the start of this paper. As a refresher you can look back to Image B and Table C if necessary. Below in Image G you will see the statistics of each location, in each HSI mosaic. Again, the statistics calculated were the number of pixels (count), minimum value, maximum value, mean value, and variance of data values.

Image G: Calculated values for each sample location in each dataset.

Looking at our initial values we can make some quick observations about the effects of binning. Looking broadly at the values we can see that the minimum value at the sample locations in NB is valued at -1 in six of the locations, with B2 dataset only having a single minimum value of -1. B3 and B4 do not have any minimum values of -1 in the sample areas. On the contrary all of the datasets have max values of 1 in sample locations ID 3 and 4. ID 3 and 4 are vegetative areas and as seen in the count column, have the most pixels to collect values from, so it’s possible this is a bit of an outlier.

Looking closer at ID 0 and 1 we can see the NB dataset has a much lower minimum value than the binned datasets. The variance for ID 0 and 1 is also much higher than the binned datasets. ID 2 sees the mean value and variance of NB and B2 are very different from each other with the mean of the NB dataset being negative, and in B2 it is positive. ID 3 and 4 have very similar mean values to each other, which is evidence in favor that the amount of -1 minimum values we have are a bit of an outlier. Looking at ID 5 and 6 which are two tiny dirt patches, that were selected to try and get pure bare earth. Here we again see a -1 minimum value in NB dataset for ID 6, where in B2 dataset the minimum value is -0.38440859. Again, more evidence that non-binned data has more outlier values. The mean values of ID 5 and 6 are also interestingly very close together, and as the smallest sample locations means the two locations are fairly consistent, though again a slightly larger discrepancy in the NB dataset.

ID 7 and 8 see a large spread of values per dataset. ID 7 and 8 could be assessed as having such large differences due to their man-made nature. However, ID 0, 1, and 2 are also man made, but they are specially designed for consistent reflectivity. There is likely an abundance of noise from the reflected light of these surface because of their material, so each level of binning is correcting for a lot of askew values. Which currently seems to be the takeaway from this NDVI comparison. The NB dataset appears to feature more noise which is corrected heavily for through binning by 2. Subsequent binning can refine values even further, but we must remember that this gets away from the nature of using a hyperspectral camera, compared to something like a multispectral camera.

To round out our analysis of these NDVI datasets, below in Image H, Image I, and Image J you will see the average difference of mean values and variance for each dataset at each sample location. Average mean value difference is being shown as actual mean value, and the average variance difference is being shown as a percentage. When it comes to the mean values, we need to remember that NDVI values are on a scale of -1 to 1, which means 0.001 is not a large difference, while 0.01 is equivalent to a 1% percent difference. The average difference between NB and B2 is -0.0355 or about a -3.55% difference in NDVI terms, while the largest difference comes from between NB and B3 at -0.0470 or -4.70%. As an average this is quite a noticeable difference, but as you can take note in Image I ID 8 is an outlier. I do not believe it is fair to remove it from the average as we are not doing this testing for a specific use case but is worth acknowledging.

Image H: Average difference of mean and variance values of NDVI values for each dataset.

Image I: Mean value differences between NDVI datasets.

Shifting focus to average difference in variance between datasets, we can see the average difference in variance is 51.02% between NB and B2. The largest average difference of variance, unlike with average difference of mean, is between NB and B4 at 71.55%. This informs us again that binning is removing a lot of noise from the data and reducing the spacing between values by a high percentage, with binning by 2 being an acceptable value to bin by. Unlike with the mean value differences there does not appear to be one sample location that is a clear outlier for all datasets, which improves confidence in our results.

Image J: Variance value differences between NDVI datasets.

Conclusion

Based on the information we have collected over the course of our test there are two primary takeaways. The first is that binning reduces the noise found in HSI. Binning does a great job at removing outlier data values at the edges of the data range. This helps when performing analysis and calculations later with the HSI as your study will suffer less from outliers that are likely not significant. On the contrary, the second takeaway is that as seen in some of our individual band analysis you do need to be wary that binning is not smoothing out data values in your HSI by too much. While Image D is our only real observation of this, care should be taken when binning your HSI. We can back this up by looking at the NDVI imagery analysis, in which B3 and B4 don’t always follow a linear path like NB and B2. For this it appears binning by 2 is preferable to working with non-binned HSI data and is an appropriate amount of binning without worry about loss of data accuracy.

You may also like