PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis

1National University of Singapore   2Wuhan University   3The University of Auckland
4Singapore University of Technology and Design   5Nanyang Technological University

Accepted by ACM MM 2024 (Oral)
(* Correspondence)

Abstract

While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and advancement, there are still gaps in defining a more holistic research target seamlessly integrating multimodality, conversation context, fine-granularity, and also covering the changing sentiment dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a multimodal conversational ABSA, where two novel subtasks are proposed: Panoptic Sentiment Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, rationale from multi-turn multi-party multimodal dialogue. Sentiment Flipping Analysis, detecting the dynamic sentiment transformation throughout the conversation with the causal reasons. To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit & explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism. Extensive evaluations demonstrate the superiority of our methods over strong baselines, validating the efficacy of all our proposed methods. The work is expected to open up a new era for the ABSA community.

Task Definition

Subtask-I: Panoptic Sentiment Sextuple Extraction. Given a dialogue D = {u1, ..., un} with the replying structure {(ui, uj), ...} (i.e., ui replies to uj), the task is to extract all sextuples (h, t, a, o, s, r). Each utterance ui = {w1, ..., wmi} contains mi words in the text (denoted as It), occasionally with associated non-text information pieces, i.e., image (Ii), audio (Ia), video (Iv). The elements h (holder), t (target), a (aspect), o (opinion), and r (rationale) can be either the continuous text spans explicitly mentioned in utterances, or implicitly inferred from contexts or non-text modalities. s represents the sentiment category (positive, negative, or neutral).

Subtask-II: Sentiment Flipping Analysis. Given input D, the same as in subtask-I, the task detects all sextuples (h, t, a, ζ, φ, τ). Here, h, t, and a denote the holder, target, and aspect, consistent with the definitions in subtask-I. ζ and φ represent the initial and flipped sentiments, respectively, highlighting the dynamic change in sentiment by the same speaker towards the same aspect of the same target. τ refers to a trigger that induces the sentiment transition, which is a pre-defined label among four categories: 1) introduction of new information, 2) logical argumentation, 3) participant feedback and interaction, and 4) personal experience and self-reflection. Since subtask-II shares multiple elements with subtask-I, it is natural to detect the flipping based on the results from subtask-I to minimize redundancy.



PanoSent: A Multimodal Conversational ABSA Dataset

We contribute a large-scale, high-quality benchmark dataset, PanoSent, featuring multiple aspects: conversational contexts, multimodality, multilingualism, and multidomain.



Sentica MLLM: Multimodal LLM Backbone

We develop a novel MLLM, namely Sentica. Specifically, we leverage ImageBind as the unified encoders for all three non-text modalities. Then, a linear layer connects ImageBind to LLM for representation projection.



CoS Reasoning Framework: Chain-of-Sentiment


▶ Step 1: Target-Aspect Identification

Given input dialogue possibly with multimodal signals and with specific instruction, the initial step aims to prompt Sentica to identify all the possible targets and their specific aspects discussed within the dialogue.


▶ Step-2: Holder-Opinion Detection

The second step is to detect the holders and their specific opinions, regarding the identified targets and aspects. We require Sentica to output a set of quadruples consisting of the holder, target, aspect, and opinion. After this step, we construct holder-target-aspect-opinion quadruples, which lay the foundation for understanding the further sentiment.


▶ Step-3: Sentiment-Rationale Mining

The third step then analyzes the sentiment with each opinion and identifies the rationale, based on the identified holder-target-aspect-opinion quadruples. We ask Sentica to output a set of sextuplets, by further adding sentiment and rationale to the previous quadruples.


▶ Step-4: Sentiment Flipping Trigger Classification

With all the sextuplets detected, the final step of discerning sentiment flipping would be much more effortless. Specifically, we prompt Sentica to first summarize any changes (i.e., from an initial sentiment to a flipped sentiment) in sentiment of the same holder-target-aspect, and then classify the trigger label for each sentiment flip.


▶ Step-5: Paraphrase-based Verification

Given that we designed the entire two-task solution as a step-wise process, a potential issue is that CoS could lead to error accumulation. An intuitive approach is to first convert the structured k-tuples into natural language expressions through paraphrasing, effectively creating a claim that conveys the same meaning in a different format. Then, let the LLM check whether this claim is in an entailment or contradiction relationship with the given dialogue context and information. We refer to this as a Paraphrase-based Verification (PpV) mechanism.


Experiment


▶ Main results of Subtask-I, Panoptic Sentiment Sextuple Extraction.

▶ Main results of Subtask-II, Sentiment Flipping Analysis.

BibTeX

@article{luo2024panosent,
  title={PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis},
  author={Luo, Meng and Fei, Hao and Li, Bobo and Wu, Shengqiong and Liu, Qian and Poria, Soujanya and Cambria, Erik and Lee, Mong-Li and Hsu, Wynne},
  journal={arXiv preprint arXiv:2408.09481},
  year={2024}
}