A term function–aware keyword citation network method for science mapping analysis

https://doi.org/10.1016/j.ipm.2023.103405Get rights and content

Abstract

Various keyword network methods are used to map scientific fields, but few studies have considered the semantic roles of keywords in such networks. This study proposes a term function–aware keyword citation network to fill this research limitation. Specifically, we first used a term function identification method to identify research questions and methods from scientific articles. Then, we constructed a question-method term citation network to represent the correlation structure of keywords. Next, we explored the topology characteristics, question-method bipartite network, and knowledge community structure of the generated network to validate its superiority in science mapping analysis. A dataset of 299,567 conference proceedings collected from the Association for Computing Machinery (ACM) digital library is used to evaluate the effectiveness of our methods. The results show that the term function identification model based on Bidirectional Encoder Representations from Transformers (BERT) achieves a score of 0.90 F1. And the question-method term citation network outperforms existing keyword citation methods in revealing association patterns between scientific knowledge and improving the interpretability of the knowledge structure of the computing field. We believe that our work expands the methodology of keyword citation network and science mapping analysis and provides guidance for considering the term function in various scenarios.

Keywords

Term function
Keyword citation network
Question-method term citation network
Science mapping analysis
Term function–aware keyword citation network

1 Introduction

Various methods and tools can help scholars map scientific fields. Previous studies have used various relationships regarding scientific articles, including direct citation, bibliographic coupling, co-citation, co-authorship, and co-word, to map the intellectual structure (Katchanov & Markova, 2022; Zhang, Xie, Song & Song, 2022), depict emerging research trends (Behrouzi, Sarmoor, Hajsadeghi & Kavousi, 2020; Katsurai & Ono, 2019), analyze topic evolution (Hu et al., 2019; Lu et al., 2021; Lu, Wang & Hu, 2020), and reveal the network community (Cheng, Wang, Lu, Huang & Bu, 2020; Lozano, Calzada-Infante, Adenso-Díaz & García, 2019). However, with the dramatic growth in the number of scientific publications, understanding the intellectual structure and relationships between concepts in an area from scientific articles remains a very challenging task (Tosi & dos Reis, 2021).

With the development of citation analysis, the citation relationship between articles has gradually expanded to entity citation or keyword citation. Various fine-grained citation networks have been proposed, such as the bio-entity citation network (Ding et al., 2013), word bibliographic coupling network (Hsiao & Chen, 2020), keyword-citation-keyword network (Cheng et al., 2020), and keyword pair-based citation relationship (Zhang et al., 2022). These keyword network methods have made significant progress and have been widely used to map scientific structures. However, in such networks, each keyword is simplified to a single symbolic expression, and its meaning is considered fixed within the field. In fact, keywords have specific semantic functions in different contexts. For example, the keyword ‘deep learning’ is the research question of the paper by Baraniuk, Donoho and Gavish (2020), whereas it represents a research method in the paper by Blanco and Lourenço (2022). Therefore, distinguishing the semantic roles of keywords in scientific articles is vital to represent the intellectual structure, which provides fine-grained and more accurate measurement for relevant analysis and applications (Cambrosio, Cointet & Abdo, 2020).

Some scholars have also realized the limited semantic information in science mapping analysis and introduced topic modeling, Word2Vec, multiple relationships between keywords, and semantic information of citations to address this issue (Hu et al., 2019; Lee, Jung & Song, 2016; Zhang & Yuan, 2022; Zhang et al., 2022; Zhu & Zhang, 2020). These methods extend science mapping analysis with more semantic information from various perspectives, but they ignore the semantic roles of keywords themselves. However, ignoring the semantic roles of keywords may distort the representation and modeling of the intellectual structure.

Therefore, this paper aims to integrate the term function into the keyword citation network, constructing a term function–aware keyword citation network and improving the performance of science mapping analysis. Generally, the semantic function of terms in scientific articles could be ‘research question’, ‘research method’, ‘technique’, ‘dataset’, or ‘domain entities’ (Färber, Albers & Schüber, 2021; Lu, Li, Liu & Cheng, 2019; Ma & Lund, 2021; Wang & Zhang, 2020). Considering that scientific research is often described as a problem-solving activity, and descriptions of problems and solutions are essential for scientific discourse (Heffernan & Teufel, 2018). Accordingly, in this study, we focus on terms with the ‘research question’ and ‘research method’ functions. Simultaneously, existing studies have used various methods, such as manual annotation, machine learning, and text generation techniques to identify the term function from literature (Färber et al., 2021; Li, Lu & Cheng, 2022; Lu et al., 2019; Luan et al., 2019; Ma & Lund, 2021; Mesbah, Lofi, Torre, Bozzon & Houben, 2018). This study compared the performance of various methods and selected the best to identify research questions and methods within the scientific literature. Overall, this paper attempts to address the following research questions:

  • RQ1. How to identify research questions and research methods from scientific literature?

  • RQ2. How can a term function–aware keyword citation network be constructed?

  • RQ3. What are the characteristics and performance of the proposed network in science mapping analysis?

To address these questions, we take the following three steps. Firstly, the term function identification model is selected and used to extract research questions and methods from the dataset collected from the Association for Computing Machinery (ACM) digital library. Secondly, we construct a question-method term citation network based on the identified functional terms and citation links between them. Finally, we analyze the topology characteristics, question-method bipartite network, and knowledge community structure of the generated network to validate its superiority in science mapping analysis and provide a fine-grained and more accurate representation of the intellectual structure in the field of computing.

This study extends previous work on keyword citation networks and science mapping analysis. Integrating the term function in the keyword citation network enhances its semantic information, improves the validity of the representation and modeling of the knowledge network, and deepens the understanding of the discipline knowledge structure. It also provides important theoretical and methodological insights for research on bibliometrics and scientific knowledge network and contributes to the mapping and interpretation of computing domains for researchers. The main contributions of this study are as follows:

  • This paper introduces a term function into the keyword citation network and a new method, called the term function–aware keyword citation network is proposed for science mapping analysis.

  • We perform various experiments and identify the best method to extract research questions and methods from the scientific literature.

  • We validate the superiority of the term function–aware keyword citation network in science mapping analysis from multiple aspects and provide an in-depth understanding of the knowledge association and knowledge structure of the computing field.

2. Related work

2.1. Science mapping analysis

Science mapping is an important research topic in information science, and the application of science mapping analysis has been gradually extending to all disciplines (Aria & Cuccurullo, 2017). Through science mapping, one can reveal the structure and evolution of a scientific discipline, analyze the cooperation network of a group of researchers and detect scientific communities or emerging research topics (Chen, 2017; Huang, Glänzel & Zhang, 2021). The general steps of science mapping analysis include data collection, preprocessing, network construction, mapping and visualization, analysis and interpretation (Alcaide–Muñoz, Rodríguez–Bolívar, Cobo & Herrera-Viedma, 2017; Lu et al., 2020). Previous studies have used various relationships to map the scientific structure, among which the citation-based and co-word relationships play critical roles when researchers want to elucidate the intellectual structure of certain research fields.

For citation-based methods, three kinds of citation relationships (i.e., direct citation, bibliographic coupling, and co-citation) are commonly used to create a network of related papers that can represent the association and structure of these papers. Bu et al. (2020) employed an all-author co-citation analysis to map knowledge domains in the field of library and information science. Kleminski, Kazienko and Kajdanowicz (2022) constructed citation networks based on direct citation, co-citation and bibliographic coupling relationships between the papers to discern valuable research topics. Mihalic, Mohamadi, Abbasi and Dávid (2021) used a citation network to map the paradigm of sustainable and responsible tourism field. Co-word analysis uses keywords shared by publications instead of shared citations and uses their co-occurrence relationships to represent the structure of a scientific field and map research topics. Zhao, Mao and Lu (2018) ranked the themes based on various node metrics over co-word networks of three different disciplines and confirmed that keyword frequency and network-based methods could effectively identify hot topics. Lu et al. (2020) constructed a co-word network from parliamentary debates to analyze the topic distribution and evolution of foreign relations. Huang et al. (2021) identified emerging topics based on dynamic co-word network analysis.

The analysis units are commonly regarded as definite symbolic expressions in the traditional co-word and citation-based networks. The current studies only concern what content and theme it reveals for a keyword irrespective of its semantic role. For example, in co-word analysis, articles are modelled as sets of keywords with no hypothesis about their distinctive role in a publication; however, it is crucial to distinguish between these different keywords according to the ontological categories (Cambrosio et al., 2020). To address the limited semantic information in traditional science mapping analysis, some scholars have made efforts to improve its performance from various perspectives. Lee et al. (2016) proposed a subject-method topic network analysis that integrates topic modeling analysis and network analysis and achieved good performance in understanding the knowledge structure of communication studies. Hu et al. (2019) applied Word2Vec to enhance the keywords with more complete semantic information and analysed the topic evolution of scientific literature using spatial autocorrelation measures. Zhu and Zhang (2020) proposed a co-word analysis method based on a subject knowledge network meta-path, which combines multiple semantic relations between words and is superior to traditional co-word analysis. Zhang and Yuan (2022) utilized the semantic and syntactic information of citations to enhance author bibliographic coupling analysis. Zhang et al. (2022) quantitatively measured the evolutionary process of knowledge within a discipline using multiple relationships between keywords.

To summarize, keyword network methods enable researchers to depict the knowledge association and elucidate the field structure, and the general analysis methods proposed by the scholars mentioned above provide a useful reference for this paper. However, the existing studies on science mapping analysis have seldom considered the semantic function of nodes in the co-word network or citation network. Therefore, this paper attempts to enhance the semantic information of the keyword citation network by integrating the term function of keywords.

2.2. Fine-grained citation network

Although numerous studies have uncovered the discipline knowledge structure using citation-based methods, most existing studies only focus on macro-level information, such as articles, journals, and authors. Some scholars have explored the fine-grained citation network to reveal the association in content between citing papers and cited papers. Ding et al. (2013) proposed that entities are either evaluative entities (e.g., papers, authors, and journals) or knowledge entities (e.g., keywords, topics, key methods, and domain entities). Furthermore, they extended the citation network from paper citation to entity citation and built a bio-entity citation network. Song et al. (2013) constructed a gene-citation-gene network of gene pairs implicitly connected through citations and determined that it can be useful for implicitly detecting gene interactions. Hsiao and Chen (2020) used subjects as the analysis unit and constructed a word bibliographic coupling network to depict the development of research subfields and research trends in library and information science. Cheng et al. (2020) selected keywords as a type of knowledge entity and proposed a keyword-citation-keyword network, which has been used for discipline knowledge structure analysis and demonstrates good performance. Zhang et al. (2022) treats keyword pairs in the same papers as nodes and uses paper citation relationship as edges to construct a keyword pair-based citation relationship.

In summary, the analysis units of the citation network have been successfully extended from articles to keywords or entities, and the fine-grained citation network has proved effective in elucidating the field knowledge structure. In this research, we further extend the current studies on the keyword citation network by distinguishing the semantic roles of keywords in such a network and propose a new network, namely, the term function–aware keyword citation network, to conduct a science mapping analysis.

2.3. Term function identification

Term function refers to the specific semantic role that a word, term, or phrase plays in scientific texts (Lu et al., 2019). Some scholars have realized the significance of distinguishing the semantic functions of words in scientific documents and have adopted various methods to realize this task. According to the solutions of the current research, term function identification is discussed in three primary ways. The first is manual annotation, which usually applies to small-scale datasets. For example, Lu et al. (2019) manually annotated the term function of author keywords from the Journal of Informetrics, including ‘research topic’, ‘research method’, ‘research object’, ‘research area’, ‘data’, and ‘others’. Ma and Lund (2021) also manually coded the research topics and methods from 3422 articles to study the evolution and shift of research topics and methods in the library and information science field. The second method is discriminant classification based on a supervised learning algorithm. For example, Heffernan and Teufel (2018) constructed a supervised learning-based classifier to discriminate between problem-approach possibilities for a given phrase. Luan et al. (2019) proposed an information extraction system called DyGIE, which identifies entities, such as tasks and methods, from input text using sequence labeing. In recent years, deep learning methods have also been applied to this task. Li, Liu, Cheng and Lu (2021), proposed a distant supervision-based approach to automatically identify data set entities from large-scale literature and achieved good performance. Färber et al. (2021) used various deep learning models to extract methods and datasets in scientific publications. Yao, Ye, Zhang, Li and Wu (2023) compared different methods to select a high-performance Named Entity Recognition model for extracting methods, datasets, and metrics from large-scale AI literature. The third method is text generation, which generates new data similar to training samples by learning corpus features. For example, Li, Lu and Cheng (2022) employed a title-generation strategy to automatically obtain problem and method information from given references.

To sum up, various approaches have been used to identify functional terms in the scientific literature. However, these methods have their own characteristics and are suitable for different scenarios. Manual annotation is costly and thus difficult to perform on large-scale domain datasets. The performance of supervised machine learning models usually relies on large-scale, high-quality training corpora. And generative strategies can save a lot of data annotation effort. In this study, we compared the performance of the classification method with the text generation method and selected a high-performance model for extracting research questions and methods from large-scale literature.

3. Method

Fig. 1 presents the overall research design of this study. First, we collected data from the ACM digital library. Then, we extracted the title, keyword, and citation information from the collected dataset, and constructed the dataset for term function identification. Secondly, we carried out experiments on term function identification and identified the best method. Subsequently, a question-method term citation network was constructed based on the identified question and method terms and citation links between them. Finally, the science mapping analysis based on the generated network was carried out through network structure analysis and visualization, question-method bipartite network analysis, and knowledge community analysis.

Fig 1
  1. Download : Download high-res image (1MB)
  2. Download : Download full-size image

Fig. 1. Overview of the method.

3.1. Data

A large-scale scientific literature in a specific field is necessary for our study. First, we introduce the original data set we collected. Next, we present the construction of the annotation dataset for term function identification models.

3.1.1. Data collection

The data used in this study are conference proceedings from the ACM digital library. ACM is a relatively complete and open access dataset and provides a comprehensive bibliographic database focused exclusively on the field of computing. Thus, we collected the full records of all the conference proceedings from the ACM digital library for the experiments. Simultaneously, this dataset is consistent with previous studies (Lu et al., 2021; Luo, Lu, He & Wang, 2022; Yang, Lu, Hu & Huang, 2022), thus ensuring its validity.

The collected dataset consists of 299,567 conference proceedings from 1951 to 2018. Fig. 2 shows the distribution of ACM conference proceedings. The number of papers presents a trend of gradually increasing with the years. Next, we extracted and stored fields, such as title, keywords, abstract, and reference, into a local MySQL database. After extracting the citation links, 156,805 articles with 467,050 citation links were obtained from this dataset for the follow-up experiments. The empirical analysis of this dataset will help researchers to have a more comprehensive and in-depth understanding of the development of knowledge in this field.

Fig 2
  1. Download : Download high-res image (81KB)
  2. Download : Download full-size image

Fig. 2. Distribution of ACM conference proceedings.

3.1.2. Data for term function identification

In order to save the high cost of manual annotation, this paper adopts a data annotation method based on the rules of papers’ titles (Kondo, Nanba, Takezawa & Okumura, 2009). In databases such as the Association of Computational Linguistics (ACL) Anthology and ACM, numerous titles have the style ‘A based on B’, ‘A using B’, and ‘A for B’. These specific titles can be largely regarded as an indication of the research questions and methods of the papers. Fig. 3 shows an example of paper with a specific title style. The title style of this example is ‘A question based on B method’, indicating that the research question and method of this paper are ‘paper similarity’ and ‘latent dirichlet allocation’, respectively. Simultaneously, these functional terms or their synonyms also appear in the abstract and keywords. Therefore, we can utilize the title, abstract and keywords of articles with such titles to construct the training corpus without any data annotation effort.

Fig 3
  1. Download : Download high-res image (2MB)
  2. Download : Download full-size image

Fig. 3. An example of paper with specific title style.

The specific process is as follows. First, we obtained 130,472 articles with the title style ‘A based on B’, ‘A using B’, and ‘A for B’ from the original dataset through template matching. Secondly, we used Standford NLP tools (Manning et al., 2014) to perform word segmentation, part-of-speech tagging and entity extraction on papers’ titles. Subsequently, we used Knuth-Morris-Pratt algorithm (Knuth, Morris & Pratt, 1977) to extract research question terms and research method terms from titles. Then we calculated the similarity between keywords and extracted terms, and keywords below the similarity threshold were excluded. In this way, we obtained the research question and research method of each article with the specific title style. To evaluate the reliability of this method, we randomly selected 1000 articles for manual annotation, and the precision reaches 95%, which shows that our annotations are sufficiently reliable.

3.2. Term function identification method

This paper used three models, namely BERT, BERT-BiLSTM, and title-generation method to identify the best model for extracting research questions and methods from scientific literature. Details of each method are as follows.

  • (1)

    BERT and BERT-BiLSTM models

BERT (Devlin, Chang, Lee & Toutanova, 2018) is a transformer-based pre-training vector representation method developed by Google that performs well on many tasks including scientific entity extraction. Bi-directional Long Short-Term Memory (BiLSTM) is also a frequently adopted model due to its good feature representation capability in many recent studies (Yao, Ye, Zhang, Li & Wu, 2023). Therefore, we applied the BERT model to obtain the embedding of the input text and fine-tune it for term function classification. In addition, we explored the combination of BiLSTM and BERT to obtain more contextual features. The structure of the models is shown in Fig. 4.

Fig 4
  1. Download : Download high-res image (446KB)
  2. Download : Download full-size image

Fig. 4. Structure of the BERT and BERT-BiLSTM models.

Firstly, the abstract of each paper Sabs = {W1, W2, W3, …, Wn} and its corresponding keywords {Keyword1, …, Keywordm} are concatenated separately as the input text. The corresponding category (research question or research method) is used as the classification label. Secondly, we use BERT to map text sequences to multidimensional spatial vectors, and the final input representation is the sum of the corresponding token, segment, and position embeddings. Subsequently, the embedding layer output is fed to BiLSTM to capture contextual information (for the BERT-BiLSTM model). Finally, the Softmax function is used to calculate the probability distribution of the dense layer output, and the classification results of the term function are obtained.

  • (1)

    Title-generation model

Title generation techniques have been widely used in various scenarios, such as e-commerce products and academic text (Li et al., 2022; Miao, Cao, Li & Guan, 2020). In this paper, the title-generation strategy aims to output a style-specific title containing the question and method information for each given scientific paper. For the construction of the title generator, we use the sequence to sequence (seq2seq) framework (Rush, Chopra & Weston, 2015) and copy-attention technique (See, Liu & Manning, 2017). Fig. 5 presents the structure of the proposed title-generation model.

Fig 5
  1. Download : Download high-res image (225KB)
  2. Download : Download full-size image

Fig. 5. Structure of the title-generation model.

The input is the preprocessed academic text sequence, where each abstract of paper is represented as {S1, S2, S3, …, Sn}, and its corresponding title is {A based on B}. Firstly, the input text is converted to vector representation by Word2vec technique. And the encoder layer composed of BiLSTM reads the input eigenvector to capture its potential semantic information and realize semantic encoding. Secondly, self-attention is calculated through the attention layer, and the attention distribution is used to produce a weighted sum of the encoder hidden states, namely the context vector. Subsequently, in the decoder layer, the generation probability is calculated by the context vector to determine whether the output of the current time step adopts the copy mechanism, that is, copying words from the original input text. Next, the generated vectors are fed to a fully connected layer and Softmax function to produce the vocabulary distribution. Finally, we can obtain the output of titles with specific styles (A based on B).

3.3. Network construction method

In this study, we consider a term function–aware keyword citation network as a kind of knowledge network composed of functional keywords from articles and citation relationships between articles. As mentioned, we focus on two types of term functions: research questions and research methods. Thus, a term function–aware keyword citation network is formally expressed as Gqm={Vq,Vm,E}. This network contains two types of nodes, where Vq is the set of research question nodes, Vm denotes the set of research method nodes, and E represents the set of edges in the network.

For the semantic relationship between any two nodes with citation links, there are mainly two types of cases. One case is that the various combinations of research questions and research methods reflect a specific meaning. In this case, four types of elements in the set of edges exist: eqq, eqm,emq and emm, representing ‘research question-research question’, ‘research question-research method’, ‘research method-research question’, and ‘research method-research method’, respectively. Specifically, eqq indicates that the research topic of the citing paper is the intersection of the two research questions, or the subdivision of a research question. Both the eqm andemq can be understood as using a certain research method to solve a research question. And emm usually means that there is a correlation between two research methods, such as combination and comparison. In this way, we can better illustrate the knowledge structure of a specific domain. Another case is to directly define the relationship between keywords based on their context. This method can clarify the specific semantic relationship between keywords, such as background, usage, extension, contrast, base, etc. However, this method requires extracting textual information between two functional terms and then determining the semantic relationship between them from the context. Given the limitations of the dataset and the purpose of this study is to extend the keyword citation network by introducing the term function, we focus on the first case.

Fig. 6 presents an example of a question-method keyword citation network. Based on the identified question and method terms, the functional terms of a citing paper and its cited paper can be connected through citation links. The question-method keyword citation network can be built by constructing all term citation pairs and calculating their frequencies. Citation links between the same terms with identical functions are excluded. Each edge has a weight w, representing the citation frequency between these two nodes. Moreover, because the citation between terms has direction, the term function–aware keyword citation network constitutes a directed network.

Fig 6
  1. Download : Download high-res image (292KB)
  2. Download : Download full-size image

Fig. 6. Example of a question-method keyword citation network.

According to the definition of the question-method keyword citation network, its specific construction processes can be described as follows:

  • 1)

    Initializing the question-method keyword citation network Gqm={Vq,Vm,E}, where Vq, Vm, and E are empty, and P represents the collection of papers;

  • 2)

    Extracting the research questions and research methods from each paper Pi, assigning a unique number to each term and counting its frequency;

  • 3)

    Extracting all citation links from the collection P;

  • 4)

    Constructing the citation pairs between the question and method terms according to the ID of any two papers with citation links and the question and method terms of the corresponding papers;

  • 5)

    Assigning a unique number to each citation pair and counting its frequency and obtaining the edge set E; and

  • 6)

    Outputting the term function–aware keyword citation network Gqm.

4. Experiments and results

This section performs term function identification experiments, network structure analysis and visualization, question-method bipartite network analysis, and knowledge community analysis based on the ACM dataset.

4.1. Term function identification experiments

For the BERT and BERT-BiLSTM models, the training corpus was divided into training set, verification set and test set in a ratio of 8:1:1. We used the pretrained BERT-Base model developed by Google for embedding and fine-tuning, which consists of 12 layers of transformers block with a hidden size of 768. The hidden dimension of BiLSTM was 200, and the batch size was 32. For the title-generation model, we randomly selected 2000 articles as the test set, and others as the training set for model fitting. The maximum sentence length was 400, the batch size was 32, and the word embedding dimensions were 300.

Table 1 shows the results of the term function identification experiments. We used precision, recall, and F1 value to evaluate the proposed models. For the title-generation model, we invited three Information Science Master's students to manually compare the generated titles with the original titles, and repeated the experiment three times and took the average value. The results in Table 1 illustrate that the BERT model outperforms the other two models. A major reason for this may be that the technique of generating new data instances is implemented based on a deep understanding of text semantics, which is still a challenging task. And the title-generation model has certain requirements for the data set, and its performance is often better on highly structured data. The BERT model and the BERT-BiLSTM model only need to discriminate between different types of data instances. The training corpus constructed in this paper is of high quality and quantity for the feature learning of the BERT model and BERT-BiLSTM model. Thus, they achieve better performance. However, a major advantage of the title-generation model is that it can directly use many titles with specific styles as training labels, saving annotation costs, and having a strong generalization ability.

Table 1. Results of term function identification.

MethodCategoryPrecisionRecallF1
Title-generation modelResearch question0.720.750.73
Research method0.760.730.74
BERT modelResearch question0.910.880.90
Research method0.890.910.90
BERT-BiLSTM modelResearch question0.900.880.89
Research method0.880.900.89

Next, the trained BERT-based term function identification model was applied to the papers without specific title styles in the original dataset. To ensure the reliability of the extraction, we randomly selected 2000 articles from the dataset to be identified for manual annotation and compared its results with the model outputs. The results showed that the BERT model performs quite well on the remaining data, with over 85% precision for the research question and research method. In this way, we obtained the research question and research method of each paper in the ACM dataset.

4.2. Network structure analysis and visualization

Based on the network construction method presented in Section 3.3, the question-method term citation network Gqm={Vq,Vm,E} was constructed. After normalization, the research question set Vq contains 109,865 distinct terms. The research methodset Vm contains 112,098 distinct terms, and the edge set E contains 1706,512 edges. Next, we perform science mapping analysis using the generated question-method term citation network.

Network structure analysis aims to reveal the characteristics of the constructed network by depicting its topology. To analyze the network topology, we selected some widely used metrics for complex networks: average degree, average weighted degree, network diameter, density, average path length, and average clustering coefficient (Lu et al., 2020; Zhao et al., 2018). These metrics represent different perspectives of the constructed network. Considering the many edges with lower weights in the constructed question-method term citation network, which were relatively random term citation pairs and could cause interference with the analysis. Through the analysis of the term citation pairs, we set a threshold at which each edge had a minimum weight of three. Then, the topology metrics of the filtered network were calculated, and the results are presented in Table 2.

Table 2. Topology metrics of the question-method term citation network.

MetricsQuestion-method term citation networkKeyword-citation-keyword network
Number of nodes998312,331
Number of edges12,90416,640
Average degree1.2931.349
Average weighted degree5.0365.327
Network diameter1615
Density0.00010.0001
Average path length6.2075.573
Average clustering coefficient0.0790.08

From Table 2, it is evident that the question-method term citation network contains 9983 nodes and 12,904 edges. The network diameter is 16, and the density is 0.0001, indicating that this network is relatively sparse. The average degree of this network is 1.293, indicating that each research question or research method only has one citation link with other terms on average. The average weighted degree is 5.036, revealing that each question or method term has an average of five citations with other terms. This result indicates that the strength of citation association between keywords is relatively weak in this network. The average path length of this network is high, and the average clustering coefficient is low, indicating that it is not a typical ‘small-world network’.

In order to compare the differences between question-method term citation network and traditional keyword citation network. We constructed a keyword-citation-keyword (KCK) network (Cheng et al., 2020) as the control group under the same conditions. The statistical results are shown in Table 2. It can be seen that the KCK network has more nodes and more edges than the question-method term citation network. Considering that the original edges of the question-method term citation network and KCK network are 1706,512 and 1683,106, respectively, this indicates that more weak associations are filtered out in the pruned question-method term citation network, and thus the validity of knowledge association is improved. This is because when a keyword citation pair is extended to multiple types, the association strength between them is dispersed. The higher network diameter and higher average path length of the question-method term citation network demonstrate that the association between terms in this network requires a longer path.

Next, we used Gephi to visualize the question-method term citation network. We set the network edge weight higher than 16 to retain critical links and achieve better visualization. The visualization result is presented in Fig. 7, which contains 36 research questions, 22 research methods and 47 edges. The node size is proportional to its degree; nodes with red and green represent the research question and research method, respectively. The direction of each arrow points from a citing term to a cited term, indicating the citation relationship between the research question and research method. From Fig. 7, core network knowledge for research questions includes such nodes as recommender system, information retrieval, wireless sensor network, and collaborative filtering, and for research methods, include such nodes as mobile device, transactional memory, Wikipedia, and Twitter. Overall, the node size of the research questions is larger than that of the research methods, indicating that researchers pay more attention to research questions than research methods in the computing field. It is noted that some terms appearing in method sets may not belong entirely to research methods, such as Wikipedia, Twitter, and Benchmark. Considering that this paper focuses on extracting research questions and methods from the scientific literature and the requirements for data integrity, we have retained these terms to represent the research objects, data sources, or platforms of the corresponding research questions.

Fig 7
  1. Download : Download high-res image (633KB)
  2. Download : Download full-size image

Fig. 7. Question-method term citation network (edge weight >16).

4.3. Question-method bipartite network analysis

As mentioned, four types of node associations exist in the question-method term citation network. Among them, question terms citing method terms and method terms citing question terms are important patterns because scientific research is often described as a problem-solving activity. To reveal the patterns and characteristics of the problem-solving activity, we selected and merged these two node-link types. Thus, the direction between question and method terms has little meaning, which can be transformed into an undirected network, namely, a question-method bipartite network. Specifically, we extracted the edges between research questions and research methods (eqm and emq) and merged the edges with the same nodes in different directions. As the question-method bipartite network is undirected, a line without an arrow represents each citation link.

The constructed question-method bipartite network consists of 9923 nodes and 6679 edges. Individual network indicators for terms equal their position in the network structure, and ability to influence and control others. We extracted the maximal connected subgraph of the question-method bipartite network and calculate the degree, weighted degree, and betweenness centrality of the terms. As shown in Table 3, Twitter, Wikipedia, Mobile phone, MapReduce, Machine learning are listed in degree, weighted degree and betweenness centrality, indicating that these methods are most central to the structure and have been widely utilized to resolve the research problems in this filed. Collaborative filtering and Symbolic execution are listed in weighted degree but not in degree, indicating that they have a powerful influence on research questions. It is noted that all the methods have a low betweenness centrality, which indicates that the correlation between terms is more likely to be direct rather than indirect through a “bridge”.

Table 3. Top-10 research methods in terms of degree, weighted degree, and betweenness centrality.

RankingMethodsDegreeMethodsWeighted
Degree
MethodsBetweenness
1Twitter120Twitter544MapReduce0.1814
2Wikipedia71Wikipedia335Machine learning0.1079
3Mobile phone61Transactional memory307Mobile phone0.0697
4Transactional memory56Mobile phone280Twitter0.0667
5Mechanical Turk53Mechanical Turk252Benchmark0.0586
6MapReduce43MapReduce204Mobile device0.0581
7Social media35Collaborative filtering172Neural network0.0533
8Machine learning32Symbolic execution162Transactional memory0.0532
9Smartphone30Machine learning151Wikipedia0.0505
10Social network29Social media138Design pattern0.0503

To achieve a higher visualization effect and retain important nodes, we set the network node degree higher than 11. The mapping result of the question-method bipartite network is presented in Fig. 8, where the nodes with red and blue represent the research question and research method, respectively. This question-method bipartite network consists of 59 nodes and 135 edges, in which the question term set contains 34 nodes and the method term set contains 25 nodes. In Fig. 8, the node size is proportional to its degree, reflecting how many research methods solve a research question or how many research questions apply to a research method. Fig. 8 reveals that the larger nodes are primarily distributed in the central network area, which are relatively important research questions and research methods in this field, such as those on information retrieval, recommender system, mobile device, collaborative filtering, machine learning, deep learning, and implicit feedback.

Fig 8
  1. Download : Download high-res image (852KB)
  2. Download : Download full-size image

Fig. 8. Question-method bipartite network (node degree >11).

Next, we use an ego network to analyze the main association objects of specific research questions or research methods. In this way, we can clearly observe the solutions to a specific research question and the typical application scenarios of a specific method, which will further enhance the understanding of knowledge associations in the computing field, and provide guidance for the selection of research questions and research methods. We took the research question on information retrieval and the research method on machine learning as examples to construct the corresponding ego networks. Specifically, we extracted all nodes related to these two nodes from the overall network and retained the edges with a weight greater than 1. Thus, the ego networks on information retrieval and machine learning can be obtained, as illustrated in Fig. 9, Fig. 10, respectively. The size of an edge is proportional to the strength of the association between two nodes. As this study focuses on the links between research questions and research methods, the links between the nodes with the same function are not considered in the ego network.

Fig 9
  1. Download : Download high-res image (991KB)
  2. Download : Download full-size image

Fig. 9. Ego network of information retrieval.

Fig 10
  1. Download : Download high-res image (464KB)
  2. Download : Download full-size image

Fig. 10. Ego network of machine learning.

As depicted in Fig. 9, the ego network of information retrieval consists of 96 edges with a total weight of 518. The node with the highest association intensity with information retrieval is language model, followed by relevance feedback, clickthrough data, probabilistic model, and machine learning, which are widely used research methods in the study of information retrieval. In addition, some nodes have a lower association intensity, such as latent semantic model, word embedding, and convolutional neural network, reflecting that the researchers use various methods to explore and study information retrieval questions. This also provides us with many reference methods when conducting information retrieval research.

Fig. 10 reveals that the ego network of machine learning consists of 27 edges with a total weight of 140. The nodes with the highest association strength involve classification and text categorization, indicating that machine learning is a frequently used research method in classification problems. Simultaneously, the method of machine learning is also widely used in research on SVM optimization, information retrieval, data mining, and collaborative filtering. In addition, various nodes have relatively weak associations, such as question answering, spam detection, and aggregated search, revealing that machine learning has become a general method in the computing field. This inspires the researcher to explore the application of machine learning methods in more research questions.

4.4. Knowledge community analysis

Community structure has been proven to exist in many networks based on actual data (Girvan & Newman, 2002), reflecting the field knowledge structure. A community is a local sub network in the overall network. Correlation is strong between members within the same community and weak between members from different communities. The relevance and clustering of knowledge in a specific field can be depicted well by identifying and analysing communities. As a special knowledge network, the aggregation of terms also exists in the term function–aware keyword citation network. Thus, one of the most popular algorithms for uncovering community structure, the Louvain algorithm (Blondel, Guillaume, Lambiotte & Lefebvre, 2008), was employed to divide the constructed question-method term citation network into clusters. The structure of the knowledge community and the composition and association of the question and method terms in each community can be revealed using this algorithm.

High-frequency words are usually identified to map the network because they represent the main topics of textual content. In this study, the network was pruned by node degree, and the threshold was set to 24 to retain the core terms for analysis. The simplified network contains 97 nodes and 390 edges. Gephi was used to detect the communities of the created network. Simultaneously, the spatial layout of nodes and edges was determined using the ForceAtlas2 algorithm (Jacomy, Venturini, Heymann, & Bastian, 2014). In terms of a rendering method, the nodes were assigned different colours based on modularity classes; node labels ending with 0 and 1 represent the research question and research method, respectively. Thus, nodes in the same color indicate that their research topics are similar, whereas those in different colours indicate that their research topics are distinct.

The question-method term citation network is divided into seven communities, representing seven main directions of the computing field. Fig. 11 presents the subjects and correlation structures of these communities, and the representative question and method terms of each community are listed in Table 4. The subject of each cluster is labelled according to the representative terms contained in each community. The modularity is 0.64, indicating the good performance of community detection (Newman, 2004).

Fig 11
  1. Download : Download high-res image (694KB)
  2. Download : Download full-size image

Fig. 11. Knowledge communities of the question-method term citation network.

Table 4. Knowledge communities of the question-method term citation network.

CommunitySubjectTerms
1Java program design and applicationjava0; transactional memory0; transactional memory1; garbage collection0; garbage collection1; android applications0; java program0; multithreaded programs0; race detection1; aspect-oriented program0; aspectj0; object-oriented program0; parallel programming0; points-to analysis0; symbolic execution1; embedded system0; concurrency control1; haskell0; javascript0; chip multiprocessor0
2Information retrievalinformation retrieval0; web search0; query expansion0; clickthrough data1; implicit feedback1; learning to rank0; language model1; image search0; search engine0; person web search0; image retrieval0; search result0; question answering0; search result diversification0; sponsor search0; query auto-completion0; information management0; hypertext0; contextual advertising0
3Mobile devicemobile phone1; mobile device0; mobile device1; smartphone1; mobile phone0; smartphone0; crowdsourcing1; mobile applications0; visual impairment0; indoor localization0; blind people0; eye-tracking1; wifi1; augment reality0; interaction0; smartwatch1; shape-changing interface0; public display0; tactile feedback1; virtual environment0
4Recommender systemrecommend system0; collaborative filtering0; recommendation0; collaborative filtering1; machine learning1; deep learning1; differential privacy1; topic model1; classification0; differential privacy0; mapreduce1; activity recognition0
5Human computer interactionchildren0; human computer interaction0; children1; human computer interaction1; motor impairments1; older adult0; cs10; computer science education0; gamification1; access control0
6Social network and social mediatwitter1; wikipedia1; mechanical turk1; social network1; social media1; wikipedia0; information extraction0; viral marketing0; online social network0; crowdfunding0
7Sensor networkwireless sensor network0; sensor network0; mobile ad hoc network0; outlier detection0; wireless mesh network0

In terms of scale, the seven communities are distinct. The larger knowledge communities are C1–Java program design and application, C2–information retrieval, and C3–mobile device. As the largest community, C1 contains 16 research questions and five research methods. This community primarily involves research questions on transactional memory, garbage collection, and Android applications, and research methods on transactional memory, garbage collection, and race detection. In addition, C2 consists of 17 research questions and three research methods. The core of this community is information retrieval and web search. The former focuses on query expansion, learning to rank, image retrieval, question answering, and search result diversification, whereas the latter involves person web search, image search, sponsor search, and search result. Simultaneously, the research methods on clickthrough data, implicit feedback, and language model provide basic technology for studying these questions. C3 focuses on the research methods on mobile phone, mobile device, and smartphone, providing solutions to the research questions on mobile device, mobile phone, and mobile applications.

The smaller knowledge communities are C4–recommender system, C5–human computer interaction, C6–social network and social media, and C7–sensor network. Recommender system and collaborative filtering are the main concerns of the C4 community, and the methods of machine learning, deep learning, and topic model are widely used in such questions. In addition, C5 focuses on the questions and methods related to groups of children and older adults in human computer interaction. Moreover, C6 concerns questions on information extraction, viral marketing, online social network, and crowdfunding based on the methods of social network analysis and social media mining, and platforms such as Twitter, Wikipedia, and Mechanical Turk. Finally, C7 is the smallest community, containing only five research question nodes on wireless sensor network, sensor network, mobile ad hoc network, outlier detection, and wireless mesh network.

5. Discussion

In the currently fine-grained citation network, to overcome the limitation that the semantic role of nodes has not been distinguished, we identified the semantic roles of keywords in articles and proposed a term function–aware keyword citation network. To identify the characteristics and evaluate the performance of the term function–aware keyword citation network, we conducted a science mapping analysis based on the generated question-method term citation network through network structure analysis and visualization, question-method bipartite network analysis, and knowledge community analysis. These findings could assist us in semantically understanding the knowledge association and knowledge structure of the computing field.

First, the term function and knowledge network are organically integrated into this study. Although the studies on term function identification and knowledge network have been ongoing for a long time, they were conducted independently. Thus, the term function is first considered in the keyword citation network, and a term function–aware keyword citation network is proposed in this study. We compared the topology metrics of the constructed question-method term citation network with that of the traditional keyword citation network, finding that the former is relatively sparse, mainly exhibiting a lower average degree, higher average path length, and lower average clustering coefficient. Principally, this result is due to the fact that many terms belong simultaneously to both research questions and research methods, and when taken as different nodes, the association strength between nodes naturally becomes weaker. This also indicates that the question-method term citation network is more capable of depicting the real associations between domain knowledge than the keyword citation network.

Second, the term function–aware keyword citation network could reveal the association patterns of domain knowledge in a fine-grained way. Compared with the keyword citation network (Cheng et al., 2020; Zhang et al., 2022), the semantic role of each keyword is no longer simplified but depends on its actual semantic function in the context. Therefore, the association patterns of domain knowledge are extended to four types. Furthermore, we extract the patterns of eqm and emq, and use a question-method bipartite network to depict the associations between research questions and research methods in the computing field well, which facilitates researchers and users to grasp the core knowledge structure. We can also select any keyword to observe the nodes associated with it, which is achieved by the ego network in this study. Thus, we can clearly observe the solutions to a specific research question and the typical application scenarios of a specific method, providing a path for a question-method recommendation.

Third, the associations among research questions and methods form a stable knowledge community. The subcommunities represents seven significant directions in the computing field. Compared with the results of the keyword-citation-keyword network (Cheng et al., 2020), the semantic function of each term is determined and labelled in this study. Thus, we can clearly understand each community's information and judge each community's internal knowledge communication mechanism. For example, the community of the recommender system contains six research question terms and six research method terms. Recommender system and collaborative filtering are focal question nodes of this community, while collaborative filtering is also a primary research method in recommendation problems. In addition, we can easily find that the methods of machine learning, deep learning, and topic model are widely used in recommender system and recommendation-related questions. However, these specific relationships between terms are difficult to describe through the keyword-citation-keyword network. Thus, from this perspective, the knowledge structure embedded in the term function–aware keyword citation network can effectively reflect the actual association among knowledge units. Combined with the semantic roles of terms, the scientificity and accuracy of interpreting knowledge structures is further improved.

Besides the findings above, this study provides several implications for researchers. First, this study enriches and expands the theory and methodology of the scientific knowledge network. The traditional research on the scientific knowledge network primarily regards keywords as a single symbolic expression; however, keywords have different meanings in different articles or contexts. To the best of our knowledge, we are the first to enhance the keyword citation network with the term function, improving its accuracy and semantics. This outcome inspires future research to extend various knowledge networks with the rich semantic information contained in the text.

Second, we explored various methods to extract research questions and methods from the scientific literature, which expands the methodology of term function identification in previous studies. The results illustrated that the BERT model achieves the best performance, and the title-generation model has the advantage of reducing the cost of manual labeling. This inspires future research to integrate the strengths of both approaches into other tasks, such as news headline generation or abstract generation from academic text.

Third, this study provides a new opportunity for science mapping analysis. Due to lacking of semantic information, the traditional science mapping analysis has the limitations of coarse granularity and low interpretability. However, the term function–aware keyword citation network facilitates the science mapping analysis of a specific field in a fine-grained, semantic, and multidimensional way. The structure and correlation of domain knowledge can be depicted through this type of network with some advantages that traditional knowledge networks do not have. This finding may offer important implications for understanding the characteristics and connotations of a specific field and for researchers to grasp the development trends of a field.

Finally, this paper defines two types of semantic relationships between functional terms. We focus on the relationships reflected by the various combinations of research questions and research methods, and carry out a multi-dimensional analysis of the computing field. However, identifying specific semantic relationships between keywords is also an important issue. This inspires future research to identify the semantics of relationships from context. In addition, how to improve the accuracy of the cited terms is also an interesting issue. This will further improve the accuracy and application scenarios of keyword citation networks.

6. Conclusion

In this paper, we introduce the term function to extend the keyword citation network and propose a novel knowledge network called the term function–aware keyword citation network. A term function identification method is designed to identify the research questions and methods from the scientific literature. Then, a question-method term citation network is built for science mapping analysis. The results demonstrate that the proposed term function identification method achieves good performance. And the proposed network is more capable of accurately depicting the associations between knowledge units and revealing the intellectual structure of a specific domain. The findings from this study contribute to the understanding of the domain knowledge structures in a fine-grained and semantic way, promoting the development of theory and method in scientific knowledge networks.

However, several limitations exist in this research. First, we only considered the research questions and research methods for the semantic roles of keywords. Although they are the most critical term functions in scientific literature, other term functions exist, such as datasets and indices. Future studies should further explore various term function–aware keyword citation networks. In addition, introducing the term function into a keyword citation network provides more possibilities and dimensions for research on scientific knowledge networks. We will continue to explore the applications of term function–aware keyword citation networks in more scenarios and tasks. Finally, given that the dataset used in this study does not contain full text information, we only identified the research questions and research methods of the cited papers from the abstracts. In the future, we will collect the dataset containing the full text of papers and try to identify functional terms from the citation context. In addition, we will attempt to identify specific semantic associations between terms and construct a multilayer network for in-depth analysis.

CRediT authorship contribution statement

Jiamin Wang: Conceptualization, Methodology, Data curation, Formal analysis, Visualization, Funding acquisition, Writing – original draft, Writing – review & editing. Qikai Cheng: Methodology, Validation, Writing – original draft, Writing – review & editing, Funding acquisition. Wei Lu: Conceptualization, Writing – review & editing, Supervision. Yongxiang Dou: Validation, Writing – review & editing, Funding acquisition. Pengcheng Li: Methodology, Writing – review & editing.

Acknowledgments

This work was partially supported by Humanities and Social Science Foundation of the Ministry of Education of China (No. 22YJC870015), National Social Science Foundation of China (No. 18BTQ089), and National Natural Science Foundation of China (No. 72174157).

Data availability

  • Data will be made available on request.

References

Cited by (0)

View Abstract