Blog

AI-enabled wearable cameras for assisting dietary assessment in African populations | npj Digital Medicine

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

npj Digital Medicine volume  7, Article number: 356 (2024 ) Cite this article Stylus Pen For Tablet

AI-enabled wearable cameras for assisting dietary assessment in African populations | npj Digital Medicine

We have developed a population-level method for dietary assessment using low-cost wearable cameras. Our approach, EgoDiet, employs an egocentric vision-based pipeline to learn portion sizes, addressing the shortcomings of traditional self-reported dietary methods. To evaluate the functionality of this method, field studies were conducted in London (Study A) and Ghana (Study B) among populations of Ghanaian and Kenyan origin. In Study A, EgoDiet’s estimations were contrasted with dietitians’ assessments, revealing a performance with a Mean Absolute Percentage Error (MAPE) of 31.9% for portion size estimation, compared to 40.1% for estimates made by dietitians. We further evaluated our approach in Study B, comparing its performance to the traditional 24-Hour Dietary Recall (24HR). Our approach demonstrated a MAPE of 28.0%, showing a reduction in error when contrasted with the 24HR, which exhibited a MAPE of 32.5%. This improvement highlights the potential of using passive camera technology to serve as an alternative to the traditional dietary assessment methods.

According to the 2016–2025 Nutrition Strategy of the World Health Organisation (WHO), the organisation collaborates with Member States to ensure universal access to healthy diets and effective interventions aimed at ending all forms of malnutrition. This goal is particularly important in African countries, which are undergoing a nutrition transition. Despite a significant proportion of the population still suffering from undernutrition in Africa, overweight is increasing in certain urban areas, along with the burden of diet-related diseases, contributing to the so-called double burden of malnutrition1. In order to plan policies to eradicate or reduce malnutrition, it is essential to have information on habitual food and nutrient intake at the individual and household levels. A commonly used tool in nutritional epidemiology studies is the 24-Hour Dietary Recall (24HR) in which participants recall and report the food types and portion sizes to experienced dietitians. However, this traditional self-reporting method is labour-intensive, expensive, and inevitably leads to a biased and inaccurate dietary analysis due to dependence on self-report. Moreover, the burdens associated with the requirement to self-report diet on multiple occasions can influence eating behaviour, study recruitment, and importantly, subsequent retention, thus biasing participant demographics. In recent years, efforts have been made to minimise reporting bias and make the process more efficient by using various dietary assessment tools, such as mobile phones to record dietary intake. However, these methods require access to mobile technology, computer literacy, and certain levels of human intervention, which may pose significant problems when deploying population-level dietary assessments in low-and middle-income countries (LMICs). To address this issue, our research group proposed a passive pipeline, EgoDiet, using simple and low-cost wearable cameras to capture food-eating episodes continuously and automatically, which can minimise the burden and technical requirements for users in dietary information collection and analysis. Using low-cost wearable cameras, this approach can potentially address the challenges of deploying population-level dietary assessments in LMICs and improve the accuracy of dietary analysis. Apart from this, due to its purely passive nature, EgoDiet not only enables data collection and analysis with minimal human involvement, but it also has clinical applications. For instance, patients with chronic diseases such as diabetes, cardiovascular diseases, or kidney diseases require strict dietary monitoring. The system can provide healthcare providers with more precise data on patients’ food intake, allowing for more tailored and effective dietary recommendations and interventions. For users suffering from malnutrition, the system can help in closely monitoring their food intake to ensure they are receiving the required nutrients. This real-time monitoring can assist in adjusting dietary plans promptly based on the patient’s actual consumption. In broader public health initiatives, the passive camera technology also moves the understanding of food intake closer to the ground truth of the nutritional intake of family members, providing insight into how food is distributed within family groups. Additionally, it offers a more accurate picture of the extent of malnutrition and the specific nutrients that are deficient in particular areas. These findings have the potential to facilitate the implementation of effective nutrition actions and policies in close collaboration with local governments.

To accurately quantify an individual’s food intake, measuring the actual amount of foods consumed (i.e. portion size) is critical for dietary assessment. With the recent development in computer vision techniques, different vision-based methods have been proposed to solve the problem of portion size estimation5. Specifically, the methods can be divided into several main categories ranging from stereo-based6,7, model-based8,9,10 and perspective transformation11,12,13 approaches. Although these methods show promising results in portion size estimation, they are mostly designed for methods which rely on active capturing of food intake (e.g. using a mobile phone app to capture the food images14,15,16,17), and thus difficult to implement in studies with passive capturing wearable cameras (where the wearable cameras continuously capture the food images automatically). For instance, stereo-based method has several constraints on the capturing angles, e.g. at least two images should be taken from different viewing angles13, to achieve feature matching and 3D reconstruction which in turn makes this method difficult to be used along with wearable cameras worn by the users. Another concern is that stereo-based approach relies heavily on feature matching between frames in order to obtain the 3D geometry of the food items. However, under low lighting conditions in LMIC households, the food surface often do not appear to have distinctive texture and characteristics which makes the stereo matching not possible. For model-based approach, it calculates the food volume by matching the food items in the image with the pre-built 3D food models. For instance, a research group proposed a virtual reality approach by superimposing 3D models with known volume onto the images with the food items18. However, this method requires certain human intervention, such as rotating and scaling the 3D models to match the food items on the image, which makes the method only suitable for use in active capture of food intake. Most importantly, the major concern is that model-based approach only measures initial portion size (i.e. the quantity of foods at the start of eating). It does not provide an estimation of the consumed portion size (i.e. the total quantity of foods eaten), contributing to large errors in nutrient intake assessments. Similar to stereo-based approach, perspective transformation approach also requires users to use hand-held cameras and take images from specific locations, e.g. putting the mobile phones on the table11, capturing images using bird-eye view19. Detailed information of various recent works based on active approach can be found in Table 1. Compared to previous works, our proposed passive method has fewer limitations, making it more versatile and convenient in providing portion size estimation. Beyond offering convenience, passive systems record important dietary behaviours such as the eating priority, personal food preferences, and meal timings. The sequence of food consumption reflects an individual’s prioritisation of different food items, a factor that can significantly differ due to cultural and personal variations. Food preference plays a crucial role in dietary assessment as it influences food choices and consumption patterns. By considering eating priority and food preferences, dietary assessments can be more accurate and personalised. Furthermore, the timing of meals holds considerable importance, influencing metabolic processes and overall health. Monitoring meal times aids in pinpointing irregular eating patterns or habits that could alter nutritional intake and affect one’s well-being. Therefore, we believe passive methods represent a new and innovative way for dietary assessment (See Supplementary Table 1 for more recent works using passive methods).

The study areas were located in both London (Study A) and Ghana (Study B). In Study A, our research group conducted a feasibility study to evaluate the functionality of two customised wearable cameras namely the Automatic Ingestion Monitor (AIM)20 and eButton21 (Fig. 1a), and the performance of EgoDiet on dietary analysis for sub-Saharan African populations living in London. The AIM is a gaze-aligned wide angle lens camera which can be attached to the temple of eyeglasses (eye-level), while the eButton is a chest-pin-like camera which can be worn using a needle-clip (chest-level). Captured images on both cameras are stored on SD cards, with the capacity to store ≤ 3 weeks of data. To conduct the feasibility study, real-life video footages with the scenarios of users eating food items of Ghanaian and Kenyan origin were captured to be used as a test dataset. 13 healthy subjects of Ghanaian or Kenyan origin, aged ≥18, were recruited to participate in the study held at the National Institute for Health Research Clinical Research Facility (CRF). Before each study started, a standardised weighing scale (i.e. Salter Brecknell) was used to first pre-weigh the food items. Preweighed food items were placed on the food containers and presented to the subjects, and the wearable cameras were worn by the participants during eating. Instead of finishing all the food items, the subjects were asked to just eat until full. Then the food remnants were post-weighed and recorded, so that the difference between the pre-weighed and the post-weighed food items can be measured as the ground truth of the consumed food weight (i.e. This method is also known as Weighed Food Records (WFRs), which provide detailed quantitative information on an individual’s diet, such as consumed portion size. WFRs are considered a gold standard method of dietary assessment and are often used to validate other dietary assessment methods, such as Food Frequency Questionnaires and 24HR). The collected raw video footages were uploaded to the cloud server and fed into the trained AI models (EgoDiet) for predicting the consumed food categories and portion size. In Study B, our research group further conducted a study to evaluate the performance of our passive cameras, estimating dietary intake of populations in Ghana under free living conditions (Fig. 1c). A total of 10 households in Ghana were recruited, typically consisting of adults (mother and father) and adolescents. All participants had provided written consent to participate in the study. Our field staff visited the targeted households in the morning and dispatched the wearable cameras. The field staffs stayed at the household only for a certain of time (i.e. around 30 min) to ensure the devices were properly worn. In the evening, our staff would return to the households after the final household meal was eaten to gather all the devices. Similar procedures were conducted that we uploaded the anonymised and encrypted footage to the cloud server for dietary assessment (Fig. 1b). In order to demonstrate the efficacy of EgoDiet against traditional methods in dietary assessment, experienced dietitians/nutritionists were also involved in our study to perform portion size estimation (i.e. estimate the food consumption of the participants manually based on the camera images captured), and their estimation was then compared against the output of our passive method. The team included three professionals from the USA, Ghana, and the UK, each with over 20 years of experience in dietary assessment methodologies. Additionally, two dietitians from Ghana also contributed to ensuring reliable portion size estimations. To ensure the fairness of the comparison, the accessors were trained to be well familiar with the foods of Ghanaian and Kenyan origin and the technique of portion size estimation (Fig. 1d). Note that a detailed study protocol has been delineated in our prior publication1, which extensively describes the research objectives, the data collection process, and important considerations. Further discussions on the acceptability, functionality and validity of our system are elaborated in a subsequent analysis paper22. These publications collectively contribute to a comprehensive understanding of our system’s potential as an alternative to traditional dietary assessment methods.

a The prototype of the customised wearable cameras (AIM20 and eButton21 devices). The descriptions of the wearable cameras are shown. b The workflow of the passive monitoring system for dietary assessment. The wearable cameras were deployed, set up prior to the study and recollected after the study by our staff in the field site. The collected data was uploaded to the cloud server and fed into the trained AI models for recognising the food categories and estimating consumed portion size. The system can generate a report showing the consumed food items and their corresponding portion size for each individual/ household in a particular region to assist dietitians in conducting nutrition surveys. c The field studies were conducted in both London (Study A) and Ghana (Study B) with populations of Ghanaian and Kenyan origin. d Passive monitoring enables the capturing of eating sequences, the identification of food items left unconsumed by the participants, and the record of meal timing and duration, from which the habits and eating behaviours of the participants could be observed. In order to demonstrate the efficacy of EgoDiet compared to traditional methods in dietary assessment, we compared our method against two commonly used dietary assessment techniques. In Study A, we compared EgoDiet to manual visual-based portion size estimation carried out by experienced dietitians, while in Study B, we compared EgoDiet to 24HR. Weighed Food Records (WFRs) were used as the ground truth in both studies. The two boxed spots (in red) are marked to indicate the two-time frames used to calculate consumed portion size. Once we know the consumed portion size, we can calculate the nutritional information using regional or national nutritional databases.

Our pipeline performs segmentation (EgoDiet:SegNet) (Fig. 2a) followed by depth estimation and 3D reconstruction (EgoDiet:3DNet) (Fig. 2b), portion size-related features extraction (EgoDiet:Feature) (Fig. 2c) and, finally, portion size estimation (EgoDiet:PortionNet) (Fig. 2d) from the captured video frames. These four modules, trained/designed for different purposes, can be used individually or combined in different ways.

a Semantic segmentation module. It is implemented using the Mask R-CNN architecture, whose primary task is to segment African food items and food containers from passively captured images. Within this framework, 15 categories of African food items have been trained in the Mask R-CNN model. b 3D reconstruction module. It is a depth estimation network with encoder-decoder architecture, with the objective to infer depth maps from RGB images, thereby obtaining corresponding 3D models from which volume can be calculated. This volume is utilised as a rough estimation of scale and serves as a feature input for subsequent network processes. c Feature extraction module. This module extracts the following features from the pixels representing food and food containers within the segmentation masks: FT, FRR, PAR, NP, AWR. d Portion size estimation module. Utilising features extracted by the preceding network, this module trains a network capable of estimating food weight. An ensemble method comprising various machine learning techniques is employed. e Raw video footage processing through Mask-RCNN, a model structured similarly to what is depicted in semantic segmentation module. The focus is to accurately identify food containers and evaluate the completeness of the captured containers. f The training procedures of the EgoDiet. Our main network training can be divided into two parts: firstly, the semantic segmentation module, which is trained using annotated masks, and secondly, the portion size estimation module, which is trained using weight-annotated data.

Instead of simply using detection methods which only indicate the location of the food items and contain little information about the portion size, this module can generate the segmentation masks of food items and the containers. EgoDiet:SegNet is based on the Mask R-CNN architecture, and we optimised it for pixel-wise segmentation and integrated into the EgoDiet. We further applied transfer learning onto the weights of the Mask R-CNN backbone pre-trained on the Microsoft Common Objects in Context (MS COCO) dataset with 328k images. This boosts the performance of the African food items detection and segmentation. Afterwards, the module was trained using self-annotated frames from videos captured in the field site (See 'Methods') and the subset of diabetes60 dataset23 (Fig. 2f). The data collection methodology of the Diabetes60 dataset closely aligns with our passive monitoring approach, particularly in terms of the camera angles and distances employed during image capture. Furthermore, we applied image augmentation to increase the network robustness against invariances and therefore increase its generalisability. Note that image segmentation is an important aspect of the research; however, this study primarily focuses on the portion size estimation, and therefore, its coverage is relatively limited. It is also worth noting that image segmentation is quite mature, and there are numerous existing approaches and techniques available nowadays.

We first evaluated the performance of EgoDiet:SegNet on raw video footage captured in Study A (London) (i.e. include both images captured by eButton and AIM). After raw video footage processing (See 'Methods'), the frames were divided into three categories (i.e. with container, incomplete container and without container) (Fig. 2e). Note that the following validation study was conducted using the frames only with containers. To investigate the generalisability of the model in food segmentation, we present the qualitative results of the semantic segmentation applied on wearable cameras. As shown in the Supplementary Fig. 1, the network demonstrates that it has great potential in dealing with several challenging scenarios. With the semantic masks of the food items, the food categories can also be deduced as a sub-product. We evaluated both frame-level accuracy, wherein our model is applied to every frame to conduct food recognition, and scenario-level accuracy in food recognition, which averages the estimation of food categories for each scenario. Unlike food recognition using a single frame, averaging the results across the whole scenario could potentially improve the food recognition rate, showing the advantage of the passive method compared to most of the existing active approaches. Further improvements can also be achieved by annotating more images with food remnants (i.e. food items with small portion sizes) to train and refine the neural network.

To facilitate large-scale studies and minimise the impact on the participants’ normal behaviour, a simple and low-cost wearable RGB camera is used to estimate portion size, rather than using the costly sensors with depth-sensing capability. This creates a new challenge due to scale ambiguity issues. To address this, we propose EgoDiet:3DNet, a neural network with encoder-decoder architecture pre-trained on the NYU Depth V2 dataset (i.e. a dataset with video sequences from a variety of household scenes captured by RGB and depth cameras), with relatively little labelling using task-relevant images. Similar ideas have been explored in previous publications17,24. The goal of this module is to obtain a rough scale for food containers, which facilitates the subsequent estimation of food portion size. To achieve this, we re-project the inferred depth images to the world coordinate system to obtain the 3D point cloud of the containers. We then apply the convex hull algorithm to the 3D point cloud to form a mesh, from which we can compute the volume of the food containers. The main advantage of using EgoDiet:3DNet to estimate container volume over previous approaches is that it takes the advantage of the coupled nature of depth and volume, and replaces the volume annotation with depth annotation, which can be easily obtained using stereo/depth cameras. We consider volume estimation as a downstream task of depth estimation and break the volume estimation task into two separate parts (i.e. depth estimation and 3D reconstruction), which simplifies the training process without requiring paired RGB images with ground truth volume of food containers. In contrast, existing approaches such as using Convolutional Neural Networks (CNN) to regress container volume require large amounts of volume-annotated data to learn contextual information from the environment. Such approach is very expensive and impractical for large-scale dietary assessment studies in the wild.

a This illustration diagram provides a representation of the EgoDiet:3DNet as applied to the captured RGB images within the EgoDIMCAP-Lab dataset. Depth images are first inferred and subsequently re-projected into world coordinates to generate 3D point cloud. This is achieved by utilising the depth values and the camera’s intrinsic parameters. Following this, a convex hull algorithm is applied to the 3D point cloud to construct a mesh, from which the volume is calculated.The same procedure is also applied to depth images captured directly by Intel RealSense D405 short-range Depth Camera (DC). b Bland-Altman analysis results. We compared the absolute errors in volumes calculated from depth images captured by the depth camera against the estimated volumes. The majority of data points fall within the Limits of Agreement (LOA), indicating similar error magnitude between the two methods. c A t-test was also conducted in this study, and the absolute error is shown for each of the food containers (i.e. 8 containers are evaluated) captured with various viewing angles. Note that the symbols * and ** denote a significance level of p < 0.05 and p < 0.01 respectively. The “ns" notation is used to indicate that there is not a significant difference between the means of the groups being compared. d Qualitative results of the EgoDiet:3DNet. The figure presents three sequential rows which show RGB images, inferred depth images, and 3D meshes of food containers respectively.

For the successful application of portion size estimation, it is necessary to ensure the features learned by the models are meaningful for the respective downstream tasks. The feature extraction module extracts important features from the previous modules. As mentioned previously, it is ineffective to estimate portion size based on a network trained end-to-end (i.e. using RGB images as input and corresponding portion size as output) due to insufficient training data. For this reason, our research group developed an easy-to-implement and robust strategy, which is to build the portion size estimation model on top of the semantic segmentation network (EgoDiet:SegNet) and depth estimation network (EgoDiet:3DNet) using task-relevant handcrafted features. These features were selected due to their direct relevance to the physical attributes of food portions that influence weight. With the use of the domain knowledge to constrain the learning algorithm, less representative training data is required, since the estimation does not rely on hidden implicit features anymore (i.e. entirely bypassing the requirement of large amount of training data for each food category in building a robust deep learning model). Another advantage is that the network generalises better and becomes more robust in handling unseen scenarios under free living which facilitates portion size estimation in different environmental conditions. The designed features include Food Type (FT), Food Region Ratio (FRR), Volume of Container (VC), Plate Aspect Ratio (PAR), Pixel Number of the Plate (NP) and Area Weight Ratio (AWR). FT is represented through one-hot encoding to categorise each food item into one of 15 types; FRR calculates the proportion of each food item within the container by comparing pixel counts; VC provides a rough estimate of container size, especially useful when no reference objects or depth sensors are available; NP offers an approximation of the distance between the camera and the food containers; PAR indicates the container’s height-width ratio to reflect the camera’s viewing angle; AWR refers to the weight per unit area, which is pre-determined through regression analysis (See 'Methods' for the detailed explanation for each designed feature).

When developing a model to estimate food portion sizes, achieving accurate predictions of the food items’ weight is crucial. However, it’s equally important for the model to be interpretable. We need to comprehensively understand the factors leading the model to categorise the weight of food as high or low after analysing the images. Identifying the most significant features in determining weight estimations is a key part of this process. To facilitate this, we employ a technique known as Mean Decrease Accuracy (MDA), also referred to as permutation importance, on our self-collected dataset comprising 261 food instances. This dataset includes different types of food, various capturing angles and distances, and varying sizes of containers and food portions. This approach helps us find the significance of each feature we’ve proposed and discern their impact on the final estimation. Permutation importance is particularly useful for evaluating non-linear estimators. It disrupts the association between the target and individual features by randomly permuting them across various subjects, thus allowing us to assess the effect of each feature on the model’s accuracy. Fig. 4a shows that all features have positive permutation importance while a positive value corresponds to a deviation from the null hypothesis (i.e. null hypothesis refers to no statistically significant relationship between the feature and the portion size). In our analysis, it becomes evident that the model for estimating portion sizes is particularly reliant on the Feature Region Ratio (FRR), which has a mean importance score of 0.7 and accounts for 43.5% of the overall influence. Notably, the FRR is derived through our semantic segmentation module, showcasing the module’s integral role in calculating this feature. Beyond the FRR, the others offer relatively smaller contributions, their importance is still highlighted by the fact that they enhance the accuracy of portion size estimation when included, as evidenced by comparisons with models trained without these features (See Fig. 4b).

a Permutation importance. We found that the food portion size estimation relies heavily on Food Region Ratio (FRR). The other features, including Food Type (FT), Volume of Container (VC), Pixel Number of the Plate (NP), Plate Aspect Ratio (PAR), Area Weight Ratio (AWR), have relatively less importance but still with positive effect on the estimation. b Evaluation was conducted using different portion size-related features and ML-based regressors, including Extra Trees (ET), Random Forest (RF), Gradient Boosting (GB) and the ensemble method. We reported the accuracy and MAE of the models. Among all these models, the ensemble method with full features obtained the best performance with higher accuracy and lower MAE compared to the rest of the models. Note that varying test data (i.e. such as capturing angles, container volumes, and the complexity of eating episodes) may lead to deviations in the results of different methods. Therefore, the ensemble method may not necessarily be the best-performing model in all scenarios. Nevertheless, our experiment clearly demonstrates the potential of the portion size-related features we extracted.

To verify the consistent performance of our proposed method in estimating portion sizes amidst natural body movements during meals, we assessed images captured from various viewing angles within identical scenarios. Our findings suggest that key features related to portion size, especially FRR, maintained consistency across different frames, which can be seen in the first and last row of Supplementary Fig. 1a. This consistency indicates that our method is robust in handling video footage with reasonable body movements. We also observed a significant difference in FRR between the second image in scenario 2 and the others, due to the food container being poorly captured, leading to overestimation of the portion size. This highlights the importance of using data pre-processing techniques in monitoring nutritional intake through passive methods. Another interesting finding is that when we compare the value of PAR in the first row of images with those in last row, we found that the former one is smaller. This is because the images in the first row were captured by a camera tilted downwards (i.e. similar to a top-down view). To further evaluate the performance of PAR, we performed an ablation study to classify whether the image was captured by eButton or AIM using PAR as the only feature (i.e. the images captured by AIM were tilted downwards, while the images captured by eButton were not). We trained a MLP classifier for this specific task and the final testing accuracy can achieve up to 98.5%. This shows a promising preliminary result on the challenging problem of estimating relative viewing angles of the cameras from images captured in the wild without relying on any deep learning-based approaches. These results unveil the potential of using the feature PAR (i.e. the aspect ratio of the food containers) to infer the viewing angles of wearable cameras, which in turn provide valuable insights into how to handle images captured with unseen positions.

Portion size estimation module refers to a network which is implemented by a stacking ensemble Machine Learning (ML) method by combining a Multi-Layer Perceptron (MLP) neural network with Random Forest (RF) and Extra Trees (ET) regressors. The module takes the designed features as input and outputs the estimated portion size, i.e. the output of EgoDiet:PortionNet is in grams since the ground truth weight (g) of a food item is easier to obtain than its actual volume (cm3). Several ML methods are also evaluated using the designed features to see whether these features also perform well in different ML models.

After the feature extraction phase, EgoDiet:PortionNet is employed to estimate the portion sizes of food items using the extracted features. The portion size estimations are then compared to WFRs, which serve as the ground truth. Following prior works, k-fold cross-validation (k = 15) was first used to examine EgoDiet:PortionNet. Figure 4b shows the accuracy of models that incorporate Machine Learning (ML)-based regressors, each trained with a combination of the proposed features pertinent to portion size estimation. The incremental addition of features results in improved accuracy for most ML-based regressors, underscoring the positive contribution of each feature to the accuracy of the final portion size estimation. Furthermore, we calculate the MAE of the portion size estimations to assess its performance. Distinct from previous studies that have utilised food volume error as the metric for evaluation5,25, this work adopts food weight error, expressed in grams, as a more intuitive metric for dietitians and a potentially more relevant measure for dietary assessments. Our models, when equipped with the full features, exhibit enhanced performance across most of ML-based regressors. Notably, the ensemble method emerges as the most proficient, achieving a MAE of 33.8 g. The performance of other ML models is shown in Table 3 for a detailed comparison. In Table 3, apart from feature inclusion studies, ablation studies demonstrate the impact of excluding specific features. The absence of FRR, FT/AWR, and VC corresponds to the evaluation of the segmentation module, food identification, and 3D reconstruction module, respectively. These studies aim to delineate the contribution of individual components to the overall performance of the system.

a The comparison between the EgoDiet and visual estimation by experienced dietitians on consumed food portion size across different episodes. Note that for bone-in foods such as chicken, we do not consider the weight of the bones in the consumed portion size in this study. Besides, given that Mask R-CNN may miss some small food items, for convenience in calculating the portion size estimation error, we do not consider the undetected food items. b An illustration diagram showing different evaluation methods used in our studies. In Study A, we compared our EgoDiet method with a manual visual-based portion size estimation method, while in Study B, we compared the EgoDiet method with the 24HR method, as described in the following section.

EgoDiet system using both AIM and eButton shows lower MAPE compared to dietitians' estimates.

To further examine the practicality of EgoDiet under free-living conditions, the wearable cameras, including the AIM and eButton (Supplementary Fig. 3), were dispatched to the volunteer households in the rural region of Ghana. Video footages were captured by the passive wearable cameras under free-living environment with minimal human intervention. In this study, we further trained our networks with augmented data and compared the performance of portion size estimation using EgoDiet with 24HR, a commonly used self-reporting dietary assessment method in nutritional epidemiology studies (i.e. the food types and the portion size are recalled from the participants and report to the experienced dietitians). WFRs were also used as the ground truth in this study (Fig. 5b). In Study B, each individual household prepared their own food at home; thus there were different varieties of food items among different households. To ensure a fair comparison, we only evaluated eating episodes that involved similar food categories as those in Study A (i.e. the seen/related food types). However, we also excluded some instances of consuming similar food types due to limitations caused by the orientation of the wearable camera, which may have impeded the capture of high-quality images necessary for accurate dietary assessment. Finally, our system was evaluated on 9 eating episodes. Figure 7a shows the consumed portion size reported and estimated by 24HR and our proposed EgoDiet respectively. We found that the EgoDiet achieves MAPE of 28.0%(95%CI: 14.8–41.3%), while 24HR achieves MAPE of 32.5%(95%CI: 24.9–40.0%) for portion size estimation (See Fig. 7b) (i.e. the MAPE is calculated using WFRs as the ground truth in this study). The results of Study B were found to be comparable to those of Study A. However, it should be noted that Study B was conducted in a more complex environment than Study A. The promising performance can likely be attributed to the fact that the food containers used in Study B typically contained only one type of food, which simplified the task. The outcome suggests that our proposed system is capable of achieving human-level performance in estimating food portion sizes, demonstrating its potential as an alternative to 24HR. This advancement could reduce the workload of dietitians by streamlining the dietary assessment process through the use of low-cost wearable cameras. Some samples of the experimental results obtained through the EgoDiet:3DNet, with the details of 3D reconstruction of food containers presented for different eating scenarios are shown in the supplementary material (See Supplementary Fig. 4).

a The portion size reported/estimated by 24HR and EgoDiet. In the scatter dot plot presented, the mean and Standard Deviation (SD) are shown. It can be observed that the EgoDiet method and Weighed Food Records (WFRs) have similar means in portion size, indicating the potential usefulness of EgoDiet in assisting with dietary assessment. b Absolute error (%) on portion size estimation between EgoDiet and 24HR. Note that WFRs are used as the ground truth in this study.

In this study, we have proposed EgoDiet, a passive dietary monitoring system that uses AI and wearable technology to facilitate dietary assessment. We found that the system shows potential in reducing error rates in dietary assessment compared to traditional methods, suggesting it could be a promising alternative for monitoring dietary intake. Given the link between diet, non-communicable diseases, and associated health risks, there is a necessity for reliable methods to monitor dietary intake. The inaccuracies in reporting dietary intake based on the current self-reporting methods significantly undermine the integrity of various research outcomes, ranging from clinical trials and epidemiological studies in evaluating the impact of nutritional policies. Moreover, the burdens associated with the requirement to self-report diet on multiple occasions can affect eating behaviour, study recruitment, and importantly, subsequent retention, thus biasing participant demographics. However, there are a few methods available nowadays that can accurately monitor dietary intake for individuals in their own environments.

With EgoDiet, we addressed several challenges in the field of dietary assessment such as self-reporting bias and participant burden. In terms of practicality, EgoDiet enables dietary assessment using low-cost passive wearable cameras, which minimise the burden and technical requirements for dietary information collection and analysis. Our empirical analysis comprises three phases, including an initial controlled environment testing phase, followed by Study A (conducted in London) and Study B (conducted in Ghana). The controlled environment testing phase allowed us to assess the system’s performance under optimal conditions, establishing the accuracy and consistency of EgoDiet’s core functionalities. Building upon these foundational results, in Study A, we found that the error rate when using the AI-powered EgoDiet approach was 31.9% for portion size estimation, whereas with traditional dietary assessment by dietitians using visual estimation, the error rate was higher, at 40.1%. In Study B, we compared EgoDiet with another traditional method, the 24HR. The results show that the error rate for EgoDiet was 28.0%, while the error rate for the 24HR method was slightly higher, at 32.5%. These comparative analyses show the potential of passive wearable technology in capturing dietary intake information. Notably, traditional dietary assessment techniques, predominantly reliant on self-reporting mechanisms such as 24HR and food frequency questionnaires, have been noted for their difficulties in achieving high levels of precision. Previous studies26,27 have highlighted the significant challenge of under-reporting, with rates reaching up to 59%, raising concerns about the reliability of these conventional methods. Besides, with our technology, dietitians are not required to conduct structured interviews to gather information on food types and portion sizes, which can reduce their workload, especially during large-scale dietary assessments. Furthermore, the cost of wearable cameras is comparable to conducting a single 24HR. As cameras are mass-produced, the cost can become even more favourable. Additionally, the reusability of cameras offers an additional cost advantage for our technology. The use of low-cost wearable cameras, which can be distributed and used by multiple participants simultaneously, also accelerates the speed of data collection.

From a technical perspective, to minimise the impact on participants’ normal behaviour when using our technology, we do not consider using any fiducial markers, but it indeed makes the task particularly challenging due to the scale ambiguity. To address this issue, we proposed EgoDiet:3DNet, a model that could roughly estimate the scale of food containers. This demonstrates the potential of employing standard RGB cameras as a substitute for costly depth cameras, offering an alternative solution to the portion size estimation. Despite the advancements in learning-based depth estimation methods, there are still constraints due to the requirement for training and testing within similar scenarios. As a result, these models are not sufficiently generalised, which can lead to less accurate estimations in significantly different scenes. Nonetheless, the continuous evolution of AI techniques is progressively addressing this challenge. Furthermore, EgoDiet:Feature and EgoDiet:PortionNet are engineered to operate together, enabling accurate portion size estimation with few annotations. With the use of the domain knowledge to constrain the learning algorithm, our network can directly target the portion size-related features without requiring to extract implicit features from large volume of training data (i.e. paired RGB images and their corresponding portion size cannot be easily obtained). This multi-module approach to calculating portion size, while potentially leading to cumulative errors, also has its advantages. It allows for the timely replacement of previous modules with state-of-the-art (SOTA) ones. For instance, with the rising popularity of large-scale foundation models, the accuracy of recognition (e.g. LLaVA28, GPT-4V), segmentation (e.g. Segment Anything29), and depth estimation (e.g. Depth Anything30) is continually improving, enabling us to optimise the performance of our network. Last but not least, it’s important to note that the majority of current algorithms are tailored for active food capture, which poses a challenge for integration with passively capturing wearable cameras. Techniques such as stereo-based methods, model-based approaches, and perspective transformation all come with their own limitations. These include restrictive capturing angles, dependency on distinct features on food items, and a need for user intervention. Such constraints significantly impede the scalability of dietary assessments.

However, there are also limitations and areas for improvement in our system, such as the privacy issues brought about by passive monitoring, which require corresponding measures. Currently, our approach involves extracting and filtering images that contain food or food containers3. During this step, any images that do not meet these criteria, including those that may capture private activities (such as using the bathroom), are automatically identified and excluded by the AI. The second step involves blurring faces in the remaining images to ensure anonymity. This two-step approach ensures that sensitive information is filtered out early in the process, providing an additional layer of privacy protection. Although AI image extraction and blurring have already matured and effectively address privacy issues, there remains potential for optimization. It continues to be a key focus for the development of passive monitoring devices. Moreover, our passive monitoring system has yet to achieve the accuracy of active capturing methods in estimating portion sizes31,32. This is primarily due to challenges such as the inability to consistently capture food containers at the image centre due to natural body movements and hand occlusion, as well as difficulties in targeting the specific start and end time frames required for calculating consumed portion sizes. These aspects underscore opportunities for further improvement within our system.

For future work, our focus will be on enhancing the food recognition capabilities of the model to make it more general and capable of identifying a wider range of food categories. Currently, our model focuses on broader food categories rather than fine-grained types. In the future, we plan to extend the system to recognize more detailed and specific food items, allowing for a finer analysis of nutritional content. Moreover, the robustness and precision in the estimation of portion sizes warrant significant enhancement. The integration of large-scale foundational models presents a promising avenue for using contextual cues from the surrounding environment to deduce food portions more effectively. Another challenge in the field of dietary assessment is the difficulty associated with shared plates33, as well as distinguishing between the acts of consuming and preparing a meal through passive monitoring. To address these problems, we are going to incorporate learning-based action recognition methods to identify various daily activities, thereby augmenting the intelligence and accuracy of our system. Apart from software improvements, we are also focused on hardware enhancements. In response to factors like image quality and capture angle of the devices, we are actively working on hardware upgrades, such as enhancing resolution and expanding the field of view, while ensuring that the capture angle remains consistent when participants wear the device. Beyond the technical scope of future work, the inherent characteristics of passive monitoring enhance the range of data available for analysis in this domain. Passive monitoring enables the capturing of eating sequences, the identification of food items left unconsumed by the participants, and the record of meal timing and duration, from which the habits and eating behaviours of the participants can be deduced. Such data could offer significant potential for future research in nutritional epidemiology. Overall, we believe that EgoDiet represents a valuable contribution to the field and has the potential to assist dietitians in their work to promote healthier diets.

The Imperial College Research Ethics Committee has provided ethics approval for the studies: approval reference number is 18IC4780 for the study at the CRF; The studies for Ghana have been approved by the ethics committees (Institutional Review Boards) at both the University of Georgia, USA, and the Noguchi Memorial Institute for Medical Research, Ghana. The ethics approval references are STUDY00006121 and IRB00001276.

The wearable cameras will be distributed to the participants, set up prior to the study and recollected after the study by our staff in the field sites. However, the urban-rural divide in LMIC could complicate the implementation of the system. In consideration of the low internet availability and inadequate electricity supply in rural regions, we temporarily store the captured images to the on-board SD card, and collected by our staff and uploaded to the cloud server after returning to urban regions. After uploading the images, the images will be forwarded to the trained networks for dietary assessments (i.e. testing process). Note that all participants have provided written consent to participate in the study.

Despite the convenience of the passive method, the implementing strategies are relatively complicated, since the data storage for continuous long-term surveillance is a challenging technical issue. It is because the passive method will capture image frames continuously, which requires a huge memory space of the devices and extensive amount of time for dietitians to manually scan through the images to estimate individual dietary intake. Several previous studies proposed selecting representative frames by circular plate detection using Canny detector18,34. Recently, the work35 proposed using Clarifai CNN, which outputs tags for each input image and determine the importance by counting the food-related tags. Similar to the work35, we also proposed a deep learning method to determine whether food items are presented in the images or not in our recent work3. Our proposed method is designed to add captions to each captured image, thereby facilitating the identification process. If the images display captions with food tags, they are classified as food-related frames. We also proposed a classification method to further remove images where the full food container is not visible (i.e. images with incomplete food containers). These removals are critical as our proposed portion size estimation network relies on the proportion between the food container and the food items on it. Conducting the estimation on images where only parts of the food container are visible increases the proportion of food items which in turn causes overestimation of the food weight. In developing the method, we first generated the segmentation masks of the food containers for each image using Mask R-CNN. Then an alterable edge region was defined, of which the parameter (e.g. the width of the edge region) could be changed according to the resolution of the images captured by wearable cameras (e.g. 640 × 480, 1280 × 720). When a high proportion of the area by the segmentation mask of the food container and edge region is overlapped, the corresponding image will be considered as containing an incomplete food container (Fig. 2d). Besides, privacy and security are of utmost importance in passive monitoring system, before uploading any images, we apply AI-driven blurring to anonymise faces and each image is encrypted to prevent access by non-research personnel.

To estimate the volume of a food container, we utilise a depth estimation network with encoder-decoder architecture3 to determine the depth of an RGB image featuring an empty container (i.e. the use of relevant regional scene images in transfer learning leads to a more substantial improvement in performance). We then project the estimated depth image into 3D space, creating a 3D point cloud. From this, we extract the 3D point cloud of the container and apply the 3D convex hull algorithm to calculate the container’s volume. The convex hull is the smallest possible convex set that contains all the points of a 3D model. For more information on volume calculations, please refer to our previous works5,36. The volume measured for the empty food container has the potential to provide scaling information, thereby aiding in the calculation of the actual portion size of the food consumed. Note that the selection of images for estimating the scale of food containers is not yet fully automated and involves certain human intervention. Selecting images without hand obstructions and containing empty containers improves the performance of 3D reconstruction and enhances the accuracy of scale determination. Additionally, including general objects of known or standard sizes, such as forks and spoons, in the selected images as references could aid in the calibration process.

The semantic segmentation module used in our study is based on the Mask R-CNN2, pre-trained on the COCO dataset37. The network architecture of Mask R-CNN is designed to perform both object detection and instance segmentation, leveraging various backbone architectures for feature extraction, bounding box recognition and mask estimation. The Mask R-CNN architecture can be divided into several parts including the backbone architecture, Region Proposal Network (RPN), RoIAlign, object detection branch and mask generation branch. The backbone architecture employs ResNet-10138 in conjunction with the Feature Pyramid Network (FPN)39 for feature extraction. FPN constructs a multi-scale feature pyramid, leveraging a top-down architecture with lateral connections. This design enables the extraction of Region of Interest (RoI) features at various levels of the pyramid, adapting to their scale. The following Region Proposal Network (RPN) is then used to generate region proposals that are likely to contain objects. Object detection branch processes aligned RoI features to classify objects and refine their bounding boxes. It identifies the object’s class and adjusts the RoI to closely match the object’s location and size. Working alongside the object detection branch, mask generation branch predicts a pixel-level mask for each detected object in a RoI. For further details on the implementation of the Mask R-CNN architecture (pre-trained on the COCO dataset) and hyper-parameter settings used in this study, we refer readers to the work by Abdulla et al. on GitHub40. Due to the differences between the food categories in the COCO dataset and the African food items, the pre-trained model can only segment a limited number of food categories. This is not sufficient to develop a dietary assessment system for African populations. Therefore, 15 common African food items were selected for annotation: onions & tomato salad, tilapia fish, ugali, yam, chicken drumstick, spinach stew, avocado, banku, tomato soup, roasted beef, chapati, onions, rice dish, salted fish, and beef stew, using the VGG Image Annotator (VIA), following the procedures described in our previous publication33. Food containers such as plates and bowls were individually categorised and trained to aid both data preparation and portion size estimation. These annotated images were then combined with the Diabetes60 dataset23. Data augmentation was used to increase the diversity of the dataset for training the models, in which flipping, rotating, shearing and cropping were applied to the input images. Mask R-CNN can accommodate input images of varying dimensions, and resizing is not required before feeding them into the network. All the networks were trained using the Adam optimiser on an NVIDIA GeForce RTX 2080 graphics card, utilising Python with Keras and TensorFlow frameworks. The best-trained model was then applied to the images after feature extraction. The loss function of the semantic segmentation module is shown in Equation (1):

where Lcls, Lbox and Lmask refer to the loss of classification, localisation and segmentation mask respectively. Lcls and Lbox are the same as in traditional object detection models. Lmask can be represented as the average binary cross-entropy loss and its formula for food category k is described in Equation (2).

where yij is the ground-truth label of the pixel(i,j) in a region of size d × d and \({\hat{y}}_{ij}^{k}\) is the predicted label of the corresponding pixel learned for food category k. When the loss function Ltotal is minimised, the segmentation masks of food items and containers are generated. By doing so, the segmentation masks help extract features for each particular food item and facilitate consumed food weight estimation for the following procedure. Since the focus of this paper is not on food segmentation, Mask R-CNN was not subjected to separate evaluation. For the effectiveness of Mask R-CNN in food segmentation, we have explored this in our previous publications33,41.

The detailed information of the portion size-related features is summarised as follows:

Food Type (FT): One-hot encoding is applied in which each categorical feature is encoded as a one-hot numeric array. It is represented by a vector with all 0 except one, which has 1 as value in its corresponding food category (15 × 1 vector in this case due to 15 food types).

Food Region Ratio (FRR): This feature refers to the ratio of the pixel number of each individual food item to plate (NFi/NP), which gives an indication of the proportion of the region for each food item in the container.

Volume of Container (VC): This feature provides a rough estimation of the container size. By applying EgoDiet:3DNet to the eating episodes, we can obtain the volume of the container. A more accurate estimation results from averaging the volumes derived from multiple images. Considering the reasonable inaccuracies in estimating container volume under free-living conditions, we round this output to the nearest 200 cm3. Although this is not highly precise at the moment, it can provide a general indication of distinguish whether the container is tiny, small, medium, or large (i.e. categorical volume calculation). Additionally, if the food containers in both the training and test datasets have the same or similar dimensions, this feature can be disregarded to simplify the training process.

Pixel Number of the Plate (NP): This feature gives a rough estimation of the distance between the cameras and the food containers when the food containers used in both the training and test dataset are the same.

Plate Aspect Ratio (PAR): This represents the height-width ratio of the container, which gives an indication of the camera viewing angle. PAR can be derived using the coordinates of the plate’s segmentation masks based on Singular Value Decomposition (SVD), as shown in the following equation and in Supplementary Fig. 2. The idea of PAR is quite intuitive. Consider two scenarios: one where an image is captured directly from above and another from an angle. Under EgoDiet’s SegNet, these perspectives would yield different segmentations, altering the proportion of the food mask within the container for the identical eating episode. This leads to variations in the FRR, potentially introducing errors in estimating portion size. It’s also important to note that variations in dish size do not affect the PAR, as this ratio is calculated based on the length-to-width proportions of the plate. Assuming the dishes are captured from a direct overhead perspective, the dimensions of the dish’s length and width are represented consistently in the images, thereby maintaining a consistent PAR across different dish sizes.

where the columns of U and the rows of VT are the left singular vectors and right singular vectors respectively. Σ is a diagonal matrix which presents the singular values. Then the PAR can be calculated by \(\frac{{\sigma }_{1}}{{\sigma }_{2}}\) .

Area Weight Ratio (AWR) Different food items have their corresponding geometry as well as density (g/cm3), and these features will affect the food weight estimation. However, from a single image without relying on any depth information, it is often difficult to infer the full geometry of the food items. Thus, AWR feature is designed for rough portion size estimation using a single image. Each food item has its corresponding AWR value, e.g. Beef Stew: 2.3; Yam: 3.5; Avocado: 2.0, which represents the amount of weight per cm2 (g/cm2). This value can be determined through regression for each particular food item before the experiments (i.e. draw a plot of mean surface area (cm2) versus weight (g) of each food item).

In this study, we utilise a domain-specific data augmentation technique known as volume-weight scaling method, which leverages the physical relationship between the inner volume of a food container and the weight of the food it contains. After applying EgoDiet:3DNet and EgoDiet:Feature to a specific image frame, a set of portion-size related features can be obtained, which form a training pair when coupled with food weight. By tuning the VC, we can generate additional training pairs as the food weight varies linearly with the inner volume of the containers. This technique is particularly useful since it has the potential to estimate portion sizes for food containers with previously unseen sizes, expanding the applicability of our method. Exploring enhanced data augmentation methods for the task of portion size estimation is an interesting topic for future research.

In our study, we implemented a variety of machine learning models using Python, relying on the scikit-learn library for the implementations. We created several models, each with explicitly selected hyperparameters to optimise its performance. We employed random forest, extra trees, and gradient boosting regressors, configured with 10, 10, and 20 estimators, respectively. Lastly, our ensemble approach involved a stacking regressor, where we stacked a random forest regressor and an extra trees regressor, each with 10 estimators, as base estimators. The meta-regressor in this ensemble was a Multi-Layer Perceptron (MLP) regressor, detailed with hidden layer sizes of (250, 250, 250, 250, 125, 50), ’relu’ activation, ’adam’ solver, an ’adaptive’ learning rate, and a maximum of 10k epochs, ensuring thorough convergence of the model.

We assessed the performance of the proposed models using several evaluation metrics, including permutation importance, accuracy of the portion size estimation, MAPE.

where ik refers to the permutation importance for feature k and M refers to the number of repetition (M = 20). s represents the reference score of the model evaluated on raw dataset, while sm,k refers to the reference score of the model evaluated on the shuffled dataset (for feature k).

where f(true ) = 1 and f(false ) = 0 is a Boolean function, and ϵ is the tolerance for determining a correct prediction (ϵ = 60). M refers to the total number of test data, \(\hat{{y}_{i}}\) and yi represents the ground truth and the estimated weight for the ith sample respectively (i.e. visual estimation of portion size is generally limited in precision, and small errors in weight(g) may be difficult to detect).

where M is the total number of samples, yi is the ground truth value for the ith sample and \(\hat{{y}_{i}}\) is the predicted value for the ith sample. The absolute difference between the ground truth and predicted values is divided by the ground truth value and multiplied by 100% to obtain the percentage error for each sample. The average of these percentage errors across all samples gives the mean absolute percentage error. In our study, MAPE is indeed a metric used to calculate the error in consumed portion sizes. yi refers the value of WFRs for the ith food item and \(\hat{{y}_{i}}\) is the estimated consumed portion size by EgoDiet or dietitians.

The dataset cannot be made publicly available due to privacy protection, general data protection regulations, and institutional guidelines of the Noguchi Memorial Institute for Medical Research Institutional Review Board (NMIMR-IRB).

The AI model was implemented using Python 3, Keras, and TensorFlow. The code for segmentation is available at https://github.com/matterport/Mask_RCNN. The machine learning algorithms can be implemented using the sklearn library.

Jobarteh, M. L., McCrory, M. A., Lo, B. et al. Development and validation of an objective, passive dietary assessment method for estimating food and nutrient intake in households in low-and middle-income countries: a study protocol. Curr. Dev. Nutr. 4, nzaa020 (2020).

He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask r-nkn. 2961–2969 (ICCV IEE, 2017).

Qiu, J. et al. Egocentric image captioning for privacy-preserved passive dietary intake monitoring. IEEE Transactions on Cybernetics (2023).

Konstantakopoulos, F. S., Georga, E. I. & Fotiadis, D. I. A novel approach to estimate the weight of food items based on features extracted from an image using boosting algorithms. Sci. Rep. 13, 21040 (2023).

Article  CAS  PubMed  PubMed Central  Google Scholar 

Lo, F. P.-W., Sun, Y., Qiu, J. & Lo, B. P. Point2volume: a vision-based dietary assessment approach using view synthesis. IEEE TII 16, 577–586 (2019).

Dehais, J., Anthimopoulos, M., Shevchik, S. & Mougiakakou, S. Two-view 3d reconstruction for food volume estimation. IEEE TMM 19, 1090–1099 (2017).

Rahman, M. H. et al. Food volume estimation in a mobile phone based dietary assessment system. In IEEE SITIS, 988–995 (IEEE, 2012).

Fang, S., Liu, C., Zhu, F., Delp, E. J. & Boushey, C. J. Single-view food portion estimation based on geometric models. 385–390 (IEEE ISM, 2015).

Jia, W., Chen, H.-C. & Yue, Y. et al. Accuracy of food portion size estimation from digital pictures acquired by a chest-worn camera. Public Health Nutr. 17, 1671–1681 (2014).

Xu, C., He, Y., Khanna, N., Boushey, C. J. & Delp, E. J. Model-based food volume estimation using 3d pose. 2534–2538 (IEEE ICIP, 2013).

Yang, Y., Jia, W., Bucher, T., Zhang, H. & Sun, M. Image-based food portion size estimation using a smartphone without a fiducial marker. Public Health Nutr. 22, 1180–1192 (2019).

Jia, W. et al. Image based estimation of food volume using circular referents in dietary assessment. J. Food Eng. 109, 76 – 86 (2012).

Lo, F. P. W., Sun, Y., Qiu, J. & Lo, B. Image-based food classification and volume estimation for dietary assessment: a review. (IEEE JBHI, 2020).

Zhu, F., Bosch, M., Khanna, N., Boushey, C. J. & Delp, E. J. Multiple hypotheses image segmentation and classification with application to dietary assessment. IEEE JBHI 19, 377–388 (2014).

Zhu, F., Bosch, M., Boushey, C. J. & Delp, E. J. An image analysis system for dietary assessment and evaluation. In 2010 ICIP, 1853–1856 (IEEE, 2010).

Zhu, F. et al. The use of mobile devices in aiding dietary assessment and evaluation. IEEE J.-STSP 4, 756–766 (2010).

Meyers, A. et al. Im2calories: towards an automated mobile vision food diary. 1233–1241 (ICCV, 2015).

Sun, M., Burke, L. E. & Baranowski, T. et al. An exploratory study on a chest-worn computer for evaluation of diet, physical activity and lifestyle. J. Health. Eng. 6, 1–22 (2015).

Lo, F. P.-W., Sun, Y. & Lo, B. Depth estimation based on a single close-up image with volumetric annotations in the wild: A pilot study. In IEEE AIM, 513–518 (IEEE, 2019).

Doulah, A., Ghosh, T., Hossain, D., Imtiaz, M. H. & Sazonov, E. Automatic ingestion monitor version 2”-a novel wearable device for automatic food intake detection and passive capture of food images. (IEEE JBHI, 2020).

Sun, M. et al. ebutton: a wearable computer for health monitoring and personal assistance. In Proceedings of the 51st Annual Design Automation Conference, 1–6 (2014).

Jobarteh, M. L. et al. Evaluation of acceptability, functionality, and validity of a passive image-based dietary intake assessment method in adults and children of Ghanaian and Kenyan origin living in London, UK. Nutrients 15, 4075 (2023).

Article  CAS  PubMed  PubMed Central  Google Scholar 

Ferdinand Christ, P. et al. Diabetes60 - inferring bread units from food images using fully convolutional neural networks. (ICCV, 2017).

Graikos, A., Charisis, V., Iakovakis, D., Hadjidimitriou, S. & Hadjileontiadis, L. Single image-based food volume estimation using monocular depth-prediction networks. In Universal Access in Human-Computer Interaction. Applications and Practice: 14th International Conference, UAHCI 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings, Part II 22, 532–543 (Springer, 2020).

Lu, Y., Stathopoulou, T. & Mougiakakou, S. Partially supervised multi-task network for single-view dietary assessment. In 2020 25th International Conference on Pattern Recognition (ICPR), 8156–8163 (IEEE, 2021).

Pettitt, C. et al. A pilot study to determine whether using a lightweight, wearable micro-camera improves dietary assessment accuracy and offers information on macronutrients and eating rate. Br. J. Nutr. 115, 160–167 (2016).

Article  CAS  PubMed  Google Scholar 

Johansson, G., Wikman, Å, Åhrén, A.-M., Hallmans, G. & Johansson, I. Underreporting of energy intake in repeated 24-hour recalls related to gender, age, weight status, day of interview, educational level, reported food intake, smoking habits and area of living. Public Health Nutr. 4, 919–927 (2001).

Article  CAS  PubMed  Google Scholar 

Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning (2023).

Kirillov, A. et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4015–4026 (2023).

Yang, L. et al. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10371–10381 (2024).

Vasiloglou, M. F. et al. A comparative study on carbohydrate estimation: Gocarb vs. dietitians. Nutrients 10, 741 (2018).

Article  PubMed  PubMed Central  Google Scholar 

Chotwanvirat, P., Hnoohom, N., Rojroongwasinkul, N. & Kriengsinyos, W. Feasibility study of an automated carbohydrate estimation system using Thai food images in comparison with estimation by dietitians. Front. Nutr. 8, 732449 (2021).

Article  PubMed  PubMed Central  Google Scholar 

Qiu, J., Lo, F. P.-W. & Lo, B. Assessing individual dietary intake in food sharing scenarios with a 360 camera and deep learning. In IEEE BSN, 1–4 (IEEE, 2019).

Nie, J. et al. Automatic detection of dining plates for image-based dietary evaluation. 4312–4315 (IEEE EMBC, 2010).

Jia, W. et al. Automatic food detection in egocentric images using artificial intelligence technology. Public health Nutr. 22, 1168–1179 (2019).

Lo, F. P. W., Guo, Y., Sun, Y., Qiu, J. & Lo, B. An intelligent vision-based nutritional assessment method for handheld food items. (IEEE Transactions on Multimedia, 2022).

Lin, T.-Y. et al. Microsoft coco: Common objects in context. In ECCV, 740–755 (Springer, 2014).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).

Lin, T.-Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2117–2125 (2017).

Abdulla, W. Mask r-cnn for object detection and instance segmentation on keras and tensorflow. https://github.com/matterport/Mask_RCNN (2017).

Lei, J., Qiu, J., Lo, F. P.-W. & Lo, B. Assessing individual dietary intake in food sharing scenarios with food and human pose detection. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part V, 549–557 (Springer, 2021).

Gao, A., Lo, F. P. W. & Lo, B. Food volume estimation for quantifying dietary intake with a wearable camera. 110–113 (IEEE BSN, 2018).

Fang, S. et al. A comparison of food portion size estimation using geometric models and depth images. 26–30 (IEEE ICIP, 2016).

Pouladzadeh, P., Shirmohammadi, S. & Al-Maghrabi, R. Measuring calorie and nutrition from food image. IEEE TIM 63, 1947–1956 (2014).

Chokr, M. & Elbassuoni, S. Calories prediction from food images. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, 4664–4669 (2017).

Fang, S. et al. Single-view food portion estimation: Learning image-to-energy mappings using generative adversarial networks. In IEEE ICIP, 251–255 (IEEE, 2018).

Pfisterer, K. J. et al. Automated food intake tracking requires depth-refined semantic segmentation to rectify visual-volume discordance in long-term care homes. Sci. Rep. 12, 83 (2022).

Article  PubMed  PubMed Central  Google Scholar 

We acknowledge the financial support by the Innovative Passive Dietary Monitoring Project funded by Bill & Melinda Gates Foundation (Opportunity ID: OPP1171395). We also acknowledge the invaluable inputs from the researchers of diverse research fields including medicine, engineering, and mathematics during this study.

The Hamlyn Centre for Robotic Surgery, Imperial College London, London, UK

Frank P.-W. Lo, Jianing Qiu, Yingnan Sun & Zeyu Wang

Department of Population Health, London School of Hygiene & Tropical Medicine, London, UK

College of Electronics and Information Engineering, Tongji University, Shanghai, China

Department of Pediatrics, USDA/ARS Children’s Nutrition Research Center, Baylor College of Medicine, Houston, TX, USA

Department of Nutritional Sciences, University of Georgia, Athens, GA, USA

Department of Health Sciences, Boston University, Boston, MA, USA

Department of Electrical and Computer Engineering, The University of Alabama, Tuscaloosa, AL, USA

Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA

Department of Neurological Surgery, University of Pittsburgh, Pittsburgh, PA, USA

Department of Nutrition and Food Science, University of Ghana, Accra, Ghana

Section of Nutrition, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK

Gary Frost & Benny Lo

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

F.L., J.Q., M.J., Y.S., Z.W., S.J., T.B., A.A., M.C., E.S., W.J., M.S., M.A., G.F., and B.L. conceived and designed the study. F.L., J.Q., M.J., Y.S., Z.W., S.J., T.B., A.A., M.C., E.S., W.J., M.S., M.A., G.F., and B.L. conducted and optimised the experimental workflow. J.Q. constructed and pre-processed the dataset. Y.S. annotated the dataset. F.L., J.Q., Z.W. and B.L. contributed to analysing and interpreting the data. F.L. and J.Q. drafted the manuscript. M.J., Y.S., Z.W., S.J., T.B., A.A., M.C., E.S., W.J., M.S., M.A., G.F. and B.L contributed to significant amendments to the final manuscript. Each author has reviewed and approved the manuscript, ensuring agreement with the content presented.

Corresponsible to Frank P.-W. Lo, Jianing QiU, Gary Frost or Benny Lo.

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Lo, F.PW., Qiu, J., Jobarteh, ML and to. AI-enabled npj Digit. Med. 7, 356 (2024). https://www.

DOI: https://doi.org/10.1038/s41746-024-01346-8

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

NPJ Digital Medicine (NPJ Digit. Med.) ISSN 2398-6352 (Online)

AI-enabled wearable cameras for assisting dietary assessment in African populations | npj Digital Medicine

2 In 1 Stylus Pen Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.