Issues with common assumptions about the camera pipeline, and their impact in HDR imaging from multiple exposures

Multiple-exposure approaches for HDR image generation share a set of building assumptions: that 5 color channels are independent, and that the camera response function (CRF) remains constant while 6 changing the exposure. The first contribution of this paper is to highlight how these assumptions, 7 that were correct for film photography, do not hold in general for digital cameras. As a consequence, 8 results of multi-exposure HDR methods are less accurate, and when tone-mapped they often present 9 problems like hue shifts and color artifacts. The second contribution is to propose a method to 10 stabilize the CRF while coupling all color channels, that can be applied to both static and dynamic 11 scenes, and yields artifact-free results that are more accurate than those obtained with state-of-the12 art methods according to several image metrics. 13

1. Introduction. The dynamic range of light intensities in a natural scene, defined as the ratio between the highest and the lowest luminance values, may easily span five orders of magnitude or more. While in most common situations the light coming from a scene is of high dynamic range (HDR), the vast majority of sensors (and displays) are of low dynamic range (LDR). The net result is that standard cameras are only able to capture different intervals of the luminance range at different exposure times; in particular, bright areas are properly captured using short exposure times, while dark areas are better captured using longer exposure times.
To overcome this limitation, Mann and Picard, in their seminal work [24], introduced the idea of creating an HDR picture of a static scene by combining a set of LDR images taken with different exposure times, proposing a parametric method to estimate the camera response function (CRF) that transforms the linear data into nonlinear form. This was soon followed by other very influential approaches for the problem that differ in the way the constant CRF is estimated, e.g., Debevec and Malik [5]; Mitsunaga and Nayar [28]; and Tsin, Ramesh, and Kanade [35]. Later works tackled more general cases, like dynamic scenes with camera and/or object motion [23,14,15,10,13,29,18] or video [34,16,11], and it can be said that multiexposure HDR creation is an ongoing research topic [27,4,21,31].
Multiple-exposure approaches that use nonlinear input pictures assume the following image formation model: where \Delta t is the exposure time, p is a pixel location, E(p) is the scene radiance value at p, f is a nonlinear transform usually denoted as the CRF, and J(p) is the resulting LDR image value, corresponding to one color channel. Analogous expressions hold for each of the three color channels, for which the function f might be different. In a static scene the values E(p) remain constant, so taking a stack of N pictures by varying the exposure times gives us for each image where the subindex i denotes the different exposures and it is also assumed that the function f remains constant as \Delta t i changes. Multiple-exposure methods estimate the inverse g of the CRF f , g \equiv f - 1 , apply it to the image values J i (p), and then divide by the exposure time \Delta t i so as to obtain one estimate of E(p) for each image i in the stack: These N estimates of E(p) are then averaged in order to provide the final output, the HDR value for pixel p.
We can see then how most multiple-exposure approaches share a set of building assumptions for the camera capture: 1. Different color channels are independent. 2. The camera response remains constant while changing the exposure.
The contributions of this paper are, first, to highlight how these assumptions do not hold in general for digital cameras, so multiexposure HDR methods based on them often produce HDR pictures that when tone-mapped show hue shifts, color artifacts, or contrast problems. As a second contribution we propose a method to improve multiple-exposure combination, compensating for the violations of assumptions (1) and (2) above and allowing us to obtain more precise HDR images from nonlinear LDR inputs, both for static and for dynamic scenes, that when tone-mapped show no signs of spatiotemporal artifacts of any kind.
2. The response function of digital cameras. The assumptions that different color channels are independent and that the camera response remains constant while changing the exposure are correct in the case of film photography, but they are not an accurate model of how digital cameras work. Digital cameras follow a typical camera color processing pipeline [3] that can be expressed as where [R, G, B] t in is the sensor triplet (usually in 12 or 14 bits); [R, G, B] t out is the pixel value at the end of the pipeline (in 8 bits per channel); A is a 3\times 3 matrix that combines the different color channels taking into account white balance, color encoding, color characterization, and a gain value; and \gamma is a value, typically between 1/1.8 and 1/3, performing gamma correction (notice that we omit demosaicking, denoising, compression, etc.; for a complete explanation of these pipeline processes, see [2]).
We claim that, contrary to what has been assumed in the multiexposure literature, 1. the three channels R, G, B are not independent because the matrix A is not diagonal, as it incorporates color processing steps, like color characterization, that involve all channels; 2. the CRF changes from one picture to the next in the multiexposure scenario. The camera automatically modifies the \gamma value and the A matrix, implying that the nonlinear transform f in (1.2) is not constant and therefore that there is no CRF to speak of because the camera response has changed from image to image. The next example illustrates the latter point. Figure 1(a) reproduces a few individual exposures (top) used by M. Fairchild in [7] to create the LuxoDoubleChecker HDR image (bottom). The images were captured in RAW format alongside the nonlinearly corrected counterparts. For a stack of N RAW pictures R i , the image formation model is and this equation is valid for the range of luminances for which the sensor operates in the linear range, above the black pedestal and below saturation. This is why, when creating an HDR image through multiple-exposure combination, professional users prefer to take RAW pictures; in this way, there is no need to estimate and invert the CRF that is applied to the nonlinearly modified pictures stored in 8 bits per channel form; this also involves camera manufacturers. Applying the logarithm to both sides of (2.2) and leaving only the exposure term on the right, we get therefore, if we plot log( R i \Delta t i ) versus log(E), we should get a single line of slope one. This is indeed approximately the case, as we can see in Figure 1(b). In principle the same could be said in the nonlinear case when applying the logarithm to (1.3): This is so because if we plot log( g(J i ) \Delta t i ) versus log(E), we should also get a single line of slope one. In practice, though, this does not always happen, as Figure 1(c) shows. The fact that the values for log( g(J i ) \Delta t i ) are rather spread implies that it was wrong to assume that f (as well as its inverse g) was constant, and therefore the conclusion is that the camera must have modified the values for some of its parameters, A, \gamma , when the exposure time \Delta t i is changed.
To support this claim and highlight how generalized this camera behavior is, we have performed tests on multiple-exposure sequences coming from four different camera models, where during capture, only the exposure time changed, and with results recorded both in linear (RAW) and in nonlinear (JPEG) form. Having the same picture in these two versions allows us to estimate the values for \gamma and the matrix A with (2.1), using the RAW data for the [R, G, B] t in values and the JPEG data for the [R, G, B] t out values. In Figure 2, columns 1--4 correspond to different sequences (shown in rows 1--4 of Figure  3) taken with different cameras, while the fifth column corresponds to the average over the 105-image HDR Survey database [7]. The first row of Figure 2 plots, for different sequences, a value that measures how far the matrix A of each image in the sequence departs from being diagonal: We have chosen for this to compute the average of the absolute value of the nondiagonal elements of A normalized by its maximum value. The fact that these values are consistently above 0.1 shows that the three channels R, G, B are not independent. The second row of Figure 2 plots the value of 1/\gamma for each image in the sequence, which is ordered from shortest to longest exposure time. We can see that, for all sequences, as the exposure time increases, the value of 1/\gamma also increases, and the change is quite substantial. The third row of Figure 2 plots the difference between the matrix A of each image in the sequence with respect to the matrix A of the middle image in the stack. 1 Again we see that the cameras are changing A from one exposure to the next.
In Figure 3, each row corresponds to a camera model from a different camera maker. The columns show tone-mapped results of the HDR pictures obtained with different multiexposure combination methods: from the RAW pictures (first column), from the JPEG pictures  Figure 3. The last column corresponds to the average over the 105-image HDR Survey database [7]. The x-axes correspond to the number of exposure images (denoted by i), and the y-axes correspond to the corresponding error. The first row shows the distance error between the color correction matrix Ai and a diagonal one Id, the second row plots the estimated 1/\gamma i for each image, and the last row presents the difference error between the A ref matrix of the middle exposure and the rest of the matrices Ai. Note that the A and \gamma values correspond to the model in (2.1).
using the multiple-exposure combination methods of Debevec and Malik [5] (column 2), and Mitsunaga and Nayar [28] (column 3), from Lee et al. [18] (column 4), and from the method proposed in this paper (last column). We can see that the previous multiple-exposure methods that take the nonlinear JPEG inputs produce results which have visible problems, like hue shifts and color artifacts; to underline that these artifacts are not due to the particular tone-mapping method used, each row employs a different, state-of-the-art tone-mapping algorithm.
In the next section we will introduce a method that, considering all three channels simultaneously, removes the fluctuations in \gamma and A, effectively making the CRF constant for the whole sequence. Its results for the above sequences are shown in the last column of Figure 3.
3. Proposed method to make the CRF constant. A schematic of our method is presented in Figure 4. The input is a set of N nonlinear LDR images acquired under different exposure times, and we select P of them as reference images. Our algorithm consists of two steps: \bullet   3.1. Step 1. Inspired by the digital camera color processing pipeline in (2.1) and the color stabilization model proposed in [36], we consider the color-matching process as key to our method because it is the one that removes fluctuations in \gamma and A and in practice turns the CRF constant for all images in the multiexposure sequence. The idea from the color stabilization model [36] is to obtain a 3 \times 3 matrix H src and two \gamma ref , \gamma src gamma correction values such that a source image I src can be color-corrected to match the colors of a given reference I ref Given the reference image I ref , for each other image I i in the sequence we do the following. We compute a set of correspondences pts ref and pts i ; we use SIFT [20], although it can be exchanged by any other method.
Then, from the set of correspondences, we build a system of equations,   At this point, we want to mention that we have considered different sizes for the matrices H i and H ref , in particular 3 \times 3 and 4 \times 4. In section 4.1, we will explain the specific implementation details for each case, and in section 4.4, we will present the results and discussion. Finally, the matrix and nonlinearity \{ \gamma i , H i \} are applied to the entire image I i as in (3.1), and we obtain the linear color-matched image: In Figure 5 we show an example of this procedure. After we have linearized and color-corrected all images in the sequence, obtaining I \prime i=1,...,N , we produce an intermediate HDR result HDR ref by performing a weighted average with a trapezoidal weighting function \omega T , in the range [0, 1], that discards extreme pixel values:

3.2.
Step 2. The intermediate HDR results present differences, as Figure 6 shows. In the top row, the leftmost image correctly captures the bright color checker while missing out the details on the dark color checker, and the reverse situation occurs with the middle and rightmost intermediate HDR results. Thus, we propose combining the different \{ HDR ref i \} i=1,...,P images to produce the final HDR output. We scale each of them so that they are all in the same range since they have been computed from different reference images captured with different exposures. In order to do this, for each HDR ref i we compute the trimean, defined as 1 4 (Q 1 + 2Q 2 + Q 3 ), where Q 1 , Q 2 , Q 3 are the quartiles; we choose the trimean in order to avoid outliers and take into account the distribution of the image data. Once all trimeans are obtained, we scale the values of each HDR ref i so that the resulting image has the same trimean as the one of a selected HDR ref sel , which in our case has been the middle-exposure reference (image 5 in a 9-image sequence). Finally, we sum the scaled set \{ HDR ref i \} i=1,...,P to obtain the final HDR image, as shown in Figure 4. By fusing them, we achieve a final HDR image with more details in both bright and dark areas.

Results and comparisons.
4.1. Experiments. As explained above in section 3.1, we linearize the stack of images with respect to a reference one, as expressed in (3.4), by estimating a nonlinearity (power-law function \gamma ) and a linear transformation (matrix H). In this section, we conduct the following experiments regarding the dimension of matrix H: (1) Estimate H as a 3 \times 3 matrix and (2) estimate H as a projective 4 \times 4 matrix.
In the case of considering H a 3 \times 3 matrix, we perform three intermediate HDR images to obtain the final one. We select the middle exposure together with the two around it: HDR ref 4,5,6 . This experiment is our proposed method Proposed, and the quantitative and qualitative results are introduced below in section 4.4. In the optimization process in Step 1, we estimate 11 unknowns for each image pair (2 \gamma values and 9 elements of the matrix).
The choice of a 4 \times 4 matrix will allow the model to be more flexible and to deal with all those pixels that lay close to the border of the color gamut. In this case, we perform as well three intermediate HDR images, but this time we select the middle exposure and the first and last exposures: HDR ref 1,5,9 . We call this experiment Projective. In the optimization process in Step 1, we estimate 17 unknowns for each image pair (2 \gamma values and 15 elements of the matrix). In addition, we consider homogeneous coordinates, by adding 1 to the [R, G, B] T color vector, for the matrix multiplication. After that, we divide the resulting vector by its last element, and we only keep the first three coordinates of vector.
The way we estimate the linearized images is slightly different than the one presented for the 3 \times 3 case. In order to compute the new linear sequence, we first estimate the set of transformations for each consecutive pair of exposed images: for I 1 and I 2 the set \{ H 1 , \gamma 1 , \gamma 2 \} , for I 2 and I 3 the set \{ H 2 , \gamma 2 , \gamma 3 \} , for I 3 and I 4 the set \{ H 3 , \gamma 3 , \gamma 4 \} , and so on. Finally, the linearization from one I i image to the given reference would be the composition of all intermediate transformations.

Database.
We performed our experiments using the HDR Survey data set by Mark Fairchild [7]. The online public-domain database contains 105 different scenes acquired using a Nikon D2x DSLR camera. The data consist of corresponding JPEG and RAW images for different exposures. In each scene, images in the sequence are numbered going from shortest to longest exposure time. All the scenes except two are composed of 9 images; the other two have, respectively, and 18 images. For the experiments and evaluation we reduce all images by a factor of 1/4, so the image size equals 1072 \times 712.

Ground-truth generation.
Let us consider N RAW images acquired with different exposure times \Delta t i . From the header of the RAW file we read the following parameters: (1) dark and saturation values, which are the minimum and maximum values that the camera produces; (2) a 3 \times 1 array containing the white balance values for each channel; and (3) the CFA Bayer pattern, e.g.,``rggb."" The ground-truth (GT) construction is defined in two stages. The first one is the merging step, where the N RAW images are combined to obtain a RAW HDR image HDR L , where \omega i is a weighting function.
We have chosen to use as weighting function t 2 , where t is the exposure time, for its simplicity and good performance, as shown in [30].
The second stage of GT creation converts the obtained HDR L into a color image by applying a number of steps: First HDR L is linearly scaled to range [0, 1]. Next, white balance is applied. Then demosaicking using the method proposed by Zhang and Xiaolin [38] is used. Finally, the color transformation is done in which each pixel triplet [R, G, B] t is multiplied by the 3 \times 3 matrix M color = E \cdot C, where E is the matrix converting XYZ values into sRGB values and C is the sensor characterization matrix that transforms RGB sensor values into standard XYZ values and is the one described in [7] for the camera used to acquire the data.
Starting with quantitative, objective evaluation, we compare the HDR outputs of each algorithm (with JPEG sequences as input) versus the computed GT using six standard metrics suggested in Hanhart et al. [12] for this purpose: for luminance, peak signal-to-noise ratio (PSNR), structural similarity metric (SSIM) [37], and HDR quality assessment HDR-VDP-2 [26]; for color, the color version of PSNR (CPSNR), the color extension of SSIM called CID [19], and the color difference measure CIEDE2000 (\Delta E \ast 00 ) [33]. Finally, we also compute the l 2 -norm on RGB space of the difference between a given method and the GT. The results, averaged over the data set, 2 are presented in Table 1. We can see that our method outperforms the others according to all metrics, except for color metrics CID and \Delta E \ast 00 , where Projective is superior, and HDR-VDP-2, where DM [5] performs better. Our algorithm comes second in both cases. Let us note that for HDR-VDP-2 the average is done over the 42 images for which photometer readings exist for the minimum and maximum absolute luminances of the scene, as these values are required by the metric.
To show that the errors due to fluctuations in camera parameters can result in very visible artifacts, in Figure 7, top to bottom, we compare the outputs of DM [5], Lee13 [18], Sen [32], and our approach for the scenes RITTiger, HancockKitchenInside, TheNarrows2, and MasonLake1, from left to right. All results have been tone-mapped with the method in [25]. For the scene RITTiger we can see that DM [5] (first row) presents a red cast in the image; therefore, Sen [32] (third row) has the same color cast since it uses DM as input. In TheNarrows2, Lee13 [18] (second row) shows very noticeable color issues. Finally, for the MasonLake1 scene, the method of Lee13 presents a blue cast, while DM presents a reddish cast, and the method of Sen shows a banding artifact effect on the sky region.
To highlight that the visual problems described before are not due to a particular choice of tone-mapping algorithm, Figure 8 shows the same HDR results but tone-mapped with two different methods: [25] for the first two columns and [6] for the last ones. The scenes are AirBellowsGap (columns 1 and 3) and LabWindow (columns 2 and 4), while the multipleexposure HDR methods to compare are, from top to bottom; MN [28], Lee14 [17], Hu [13], and our approach. We can see how the previously existing methods produce color artifacts in the sky and sun of the AirBellowsGap scene and in the curtains, sky, and background of the LabWindow scene, which are apparent for both of the tone-mapping methods used.
4.4.1. Experimental discussion: 3 \times 3 versus 4 \times 4 matrix \bfitH \bfiti . Next, we point out interesting outcomes from the comparison of both experiments. Although the 4 \times 4 results were expected to outperform the 3 \times 3 ones, the quantitative results suggest otherwise (see Table 1). The explanation for that relies on the matrix definition. Let us introduce a projective transformation, where (A 3\times 3 | w) is an affine transformation and v T represents the transformation of the``line at infinity."" Notice that the information about exposure time is carried by matrix A 3\times 3 . In the context of HDR acquisition, in images with a large difference in exposure time, the offset w might affect dark/bright areas containing relevant data by making them darker/brighter DM Lee13 Sen Proposed Figure 7. From left to right: scenes RITTiger, HancockKitchenInside, TheNarrows2, and MasonLake1 from the HDR Survey. From top to bottom: results from DM [5], Lee13 [18], Sen [32], and our approach. All images tone-mapped with [25].
during the optimization process. This fact has a huge impact on the recovery of the final HDR. As an example, let us consider a well-exposed image (as the reference) and an underexposed image (which captures information on bright scene areas). In this case, the estimated 4 \times 4 matrix might brighten all the underexposed regions by selecting a very high value of the offset parameter and not by selecting high values in the first three elements of the diagonal. When this happens, the high values of the w offset clearly compromise the dynamic range of the final HDR by reducing it.
Nevertheless, the Projective choice shows better performance in terms of color. As an example, Figure 9 shows the linearization results of a set of images, considering 3 \times 3 and 4 \times 4 matrices. From the 507 scene, we select two images, I 5 , and I 8 ; see first row. We set image 5 as the reference one. This example helps us to show that the projective transformation (third row) outperforms qualitatively the 3 \times 3 case (second row) in terms of color. Let us focus on the frontal part of the car: On the one hand, the color-matched image I 8 using 3 \times 3 matrix (second row, third column) presents a brighter red. On the other hand, the projective transformation allows recovering the darker red color of the car (third row, third column). The same occurs in the blue digits 507. In the 3 \times 3 case (second row, last column), the blue MN Lee14 Hu Proposed Figure 8. First and third columns: AirBellowsGap scene. Second and fourth columns: LabWindow scene. Columns 1 and 2: results tone-mapped with [25]. Columns 3 and 4: results tone-mapped with [6]. From top to bottom: results from MN [28], Lee14 [17], Hu [13], and our approach. color appears brighter than the reference image, whereas in the projective case, the blue is recovered. This explains why the Projective outperforms the Proposed one in terms of the color metrics (CID and \Delta E \ast 00 ). 4.4.2. Dynamic scenes. It is worth emphasizing that the proposed algorithm does not require image registration, only a set of pixel correspondences. Therefore, it can be used on dynamic scenes as well: In particular, Step 1 of our method can be employed as a preprocessing step to color-stabilize the inputs of HDR methods operating on dynamic scenes, enhancing their performance. To illustrate this, we consider the algorithm of Sen et al. [32], which receives linearized images as input. Consequently, given a stack of nonlinear images, we compare three linearizing approaches: (1) CRF computed by Debevec and Malik [5], (2) radiometric calibration by Lee et al. [18], and (3) applying Step 1 of our method, using as a reference the image in the midpoint of the sequence and finding pixel correspondences with SIFT [20]. We conducted this experiment on a stack of five images from the data set presented in [32], the Skater sequence (which comes in JPEG format). In Figure 10 we present the HDR outputs obtained using the three different linearization approaches. The zoomed-in details allow us to see how linearization by [5] (left) produces artifacts in overexposed areas,  Figure 9. Top row: the LDR images 5 (reference) and 8 from the 507 scene [7]. Middle row: the colormatched results considering a 3 \times 3 matrix (H). Bottom row: the color-matched results considering a 4 \times 4 matrix (H \prime ). The last two columns present the ROI shown as cyan rectangles in each corresponding row. Notice the red color in the front of the car and the blue numbers 507. The 4 \times 4 results are closer to the reference than the 3 \times 3 ones. Figure 10. HDR results on a dynamic scene applying the HDR creation method of Sen et al. [32] with three different linearization techniques: Debevec and Malik [5] (left), Lee et al. [18] (middle), and step 1 of our proposed method (right), taking as reference the image in the midpoint of the sequence. HDR results tone-mapped with Mantiuk, Daly, and Kerofsky [25].
whereas linearization with [18] produces results that, although free from artifacts, have lower contrast and less saturated colors than what can be obtained with our method.

Conclusions.
Our experiments show that the camera response function changes with the exposure and depends on the three color channels simultaneously. For this reason, multiexposure HDR approaches based on estimating and inverting a CRF that is supposed to be constant may have substantially more error than if computed directly from the linear data, and when tone-mapped, they commonly show hue shifts, color artifacts, or contrast problems. In this work we have proposed a method for removing the fluctuations in the internal settings that the camera has automatically modified, so that our approach effectively makes the CRF constant for the whole sequence. It can be applied to both static and dynamic scenes. Our results are more accurate than those obtained with state-of-the-art methods and show no visual problems.