- Title
- Food data integration by using heuristics based on lexical and semantic similarities
- Creator
- Popovski, Gorjan; Ispirova, Gordana; Hadzi-Kotarova, Nina; Valencic, Eva; Eftimov, Tome; Seljak, Barbara Korousic
- Relation
- 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020). Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) (Vienna, Austria 11-13 February, 2020) p. 208-216
- Publisher Link
- http://dx.doi.org/10.5220/0008990602080216
- Publisher
- Science and Technology (SCITEPRESS)
- Resource Type
- conference paper
- Date
- 2020
- Description
- With the rapidly growing food supply in the last decade, vast amounts of food-related data have been collected. To make this data inter-operable and equipped for analyses involving studying relations between food, as one of the main environmental and health outcomes, data coming from various data sources needs to be normalized. Food data can have varying sources and formats (food composition, food consumption, recipe data), yet the most familiar type is food product data, often misinterpreted due to marketing strategies of different producers and retailers. Several recent studies have addressed the problem of heterogeneous data by matching food products using lexical similarity between their English names. In this study, we address this problem, while considering a non-English, low researched language in terms of natural language processing, i.e. Slovenian. To match food products, we use our previously developed heuristic based on lexical similarity and propose two new semantic similarity heuristics based on word embeddings. The proposed heuristics are evaluated using a dataset with 438 ground truth pairs of food products, obtained by matching their EAN barcodes. Preliminary results show that the lexical similarity heuristic provides more promising results (75% accuracy), while the best semantic similarity model yields an accuracy of 62%.
- Subject
- data normalization; food integration; lexical similarity; semantic similarity; word embeddings
- Identifier
- http://hdl.handle.net/1959.13/1436246
- Identifier
- uon:39960
- Identifier
- ISBN:9789897583988
- Language
- eng
- Reviewed
- Hits: 2023
- Visitors: 2020
- Downloads: 0
Thumbnail | File | Description | Size | Format |
---|