Fake news, dis-, mis-, and mal-information do not appear randomly, and they serve a broader purpose of supporting specific narrative structures which define distinct identities. There are financial, political, and societal incentives for strengthening these identities. Individual fake news is a quasi-random reflection of these narrative structures. As long as an information entity (a news story, a newly coined term, an out-of-context statistic) supports a narrative structure, it can be incorporated into the identity.
Our current approach to fake news revolves around classification (recognizing fake news) and dissemination (observing the spread of fake news). We use standard representations (text, image) enriched with social network analysis (who shares what with whom), but these representations reflect the contents of fake news and its lifetime. We do not capture the process which has led to the creation of a particular fake news item. In other words, we have access only to the form and not to the substance.
What we need is a radical change in the way we represent news items. We need a representation capable of revealing the elements of the underlying narrative structures that provide the context to each news item. When a human expert examines a particular news item, she can easily spot the incentives, narratives, and purposes behind the item. We must develop data representations that allow for easy encoding of such expert's knowledge.
Weakly supervised learning is a new paradigm of training statistical models that quickly gains popularity among machine learning practitioners. In particular, using sparse labeling functions to provide fuzzy labels to items is an auspicious research direction. We propose to apply weakly supervised learning via labeling functions to encode the domain knowledge and gain a much deeper understanding of the fake news creation process. In our approach, weakly supervised learning is not the way to train statistical models but to train data representations. To this end, we need to solve several challenging problems:
- develop a methodology of knowledge engineering using labeling functions,
- design the repository of labeling functions,
- adapt machine learning algorithms to a new data representation via labeling signals,
- encode several domains using hundreds of labeling functions to evaluate the feasibility of the approach experimentally.
We believe that labeling functions can be a very effective weapon against fake news. Firstly, even after defining a few hundred of these functions, we can achieve high accuracy in recognizing fake news from the past. More importantly, though, we can identify narrative structures by observing the correlations between labeling signals and anticipate the occurrence of new fake narratives. The repository of labeling functions can be incrementally updated and extended to cover new application domains (e.g., public health, economy, ecology, pandemics) as domain knowledge is encoded in labeling functions.
The research envisioned in this proposal focuses primarily on knowledge representation, knowledge engineering, natural language processing, and machine learning, but the input of social sciences is very welcomed. The main aim of the proposal is to develop a fundamentally different representation of data which will (hopefully) give us unprecedented tools to fight the wave of disinformation.