Bulletin of the Technical Committee on Learning Technology (ISSN: 2306-0212) |
Authors:
Yu Bai1, and Xinyang Li1
Abstract:
In film and television education, shot scale identification is a key link in developing students’ visual literacy and narrative understanding. However, in teaching, shot scale recognition mainly relies on Professors’ manual labeling in advance, which has the problems of low automation, high subjectivity, and insufficient efficiency, limiting the depth and breadth of teaching and research. This study, based on the MovieShots Dataset, designed and developed an automatic shot scale recognition model using deep learning. It utilizes YOLOv5 to achieve the automatic classification of Long Shot (LS), Full Shot (FS), Medium Shot (MS), Close-Up(CU), and Extreme Close-Up(ECU). Additionally, it combines PySceneDetect technology for shot segmentation and video analysis. The experimental results indicate that the model’s average precision (mAP@0.5) on the test set reaches 85.7, significantly improving the analysis efficiency. Subsequently, the model was applied in a simulated cinematic education classroom where one instructor and ten students utilized it for shot scale recognition. According to the experimental results, the model recognition effect and the Professor recognition results reached 90% recognition accuracy, and the time efficiency was greatly improved. At the same time, according to the results of the questionnaire survey, students have a better acceptance of using the model for teaching assistance. At the same time, according to the subjective interview feedback, Professors and students tend to use this efficient automatic learning aid to reduce the burden of manual annotation.
Keywords: Film & TV Education, Shot Scale Recognition, Deep Learning, Automated Analytics
I. INTRODUCTION
In film and television education, the ability to recognize and analyze shot scale is essential. Bordwell, Thompson, and Smith (2004) define the key to film narrative as camera analysis, calling it a decoder [1]. Learners can decode the director’s rhetorical strategies and ideologies by parsing visual parameters such as shot scale, angle, and movement. Additionally, shot language, as the core artistic form of film, is also directly involved in the emotional transmission and audience perception of a film. fMRI experiments by Hasson et al. (2008) have demonstrated that cinematic camera language synchronizes the audience’s brain activity and triggers similar emotional responses [2]. In film and television education, students understand the director’s intention and enhance their creative abilities by analyzing the shot scales [3].
Then, for the identification of the shot scale, due to the identification efficiency and subjective cognitive bias, it reduces the teaching scalability of the course. First of all, different experts have different recognition standards for the same shot scale, and subjective bias affects the unity of results, shot scale recognition, which makes it difficult to form teaching standardization [4]. At the same time, manual annotation is time-consuming, especially when analyzing complete films or multiple works, limiting classroom depth and breadth [5].
With the development of educational big data [6] and deep learning technology [7], image recognition technology has been able to accurately recognize objects, people, and backgrounds in images [8]. To this end, this study leverages the existing shot category MovieShots database and utilizes the mature neural network training framework YOLOv5, this study designs and develops a deep learning-based shot scale automatic recognition model, aiming to improve the analysis efficiency of film and television education, reduce the subjectivity, and realize the automation, to provide support for teaching and research.
II. LITERATURE REVIEW
In film and television education, the shot scale is important for accurately identifying and analyzing. In film and television education, shot scene recognition is the basis for developing students’ aesthetic and narrative skills. At the same time, film and television shot scale recognition is the foundation of film and television montage language literacy [9], which is the starting point for learning montage language [10]. Currently, film and television Professors are commonly faced with a high workload of shot annotation. Studies have shown that it takes an average of 6-8 hours to manually analyze a 90-minute film, which seriously crowds out time for curriculum design and individualized instruction [11]. According to incomplete statistics, more than 87% of top film and television colleges and universities include shot scale identification and analysis as a core component of required courses. Shot scale categorization training accounts for 42% of the hours of visual narrative instruction and is a foundational skill for students to master the language of directing [12].
Limitations of Manual Recognition of Shot Scales. Currently, in film and television teaching, the shot scale recognition is done manually by Professors or students. During this time, they will use basic tools, such as frame-by-frame viewing with a player. First, this traditional operation often encounters efficiency bottlenecks. Jacobs (2023) showed that manually analyzing a 5-minute film clip takes more than 30 minutes on average, which includes pausing, recording, and checking the video [11]. Second, the differences in experience and perspectives of different analysts lead to poor consistency of results, which affects the standardization of teaching. Pearlman (2019) found through analysis that due to the existence of subjectivity bias, the consistency rate of recognition of the same shot scale category by different experts is only 68% [13]. When the recognition object involves motion shots, the discrepancy is even greater. Finally, teaching is poorly scalable. Manual annotation is time-consuming, especially when analyzing complete films or multiple productions, limiting the depth and breadth of the classroom. Based on a review of the current state of teaching film and television, Stenliden (2014) stated that traditional lectures are limited by manual efficiency, time constraints, and the size of the course and that the in-depth camera shot categories to which students have access are extremely limited in a visualization course such as the one on shot language analysis [14]. These shortcomings hinder students’ systematic learning of shot language and limit the instructor’s ability to conduct in-depth research in a limited amount of time.
Deep learning methods supported by large-scale shot-scale data. So far, publicly available large-scale shot category databases, including FilmNet Dataset, BBC TV Archive Dataset, and MovieShots Dataset, have been able to meet the requirements of supporting fine-grained shot analysis, providing standardized benchmarks, and supporting the tasks of shot recognition, scene segmentation, etc. [15]. The MovieShots Dataset is one of the largest movie shot annotation datasets currently available, including v1 and v3 versions. The former includes 46,000 movie trailer shots, and the latter consists of 10 full movie shots. These two versions have a total of 5 shot scales, which are LS, FS, MS, CU, and ECU, providing a standardized benchmark for shot scale recognition studies [16]. Among many deep learning frameworks, the YOLO model demonstrates the efficiency of deep learning in target detection and provides a technical basis for shot-scale recognition. Also, it is a lightweight architecture (YOLOv5s), reducing computational cost [17]. This study utilizes the MovieShots Dataset and YOLOv5, in conjunction with video processing techniques, to bridge the automation gap and improve analysis consistency and efficiency.
III.METHODOLOGY
This study designs and develops a deep learning-based automatic identification model for shot scale to address the problems of low automation, subjectivity, and lack of efficiency in shot analysis in film and television education. To achieve this goal, the research methodology is divided into three core parts: data collection and organization, model design and development, and model function evaluation. In this study, the MovieShots Dataset is selected, and the YOLOv5 framework is utilized to train the shot scale recognition model, and combined with PySceneDetect to realize the end-to-end process from video input to analysis output.
A.Data collection and processing
This study uses the MovieShots Dataset as the underlying data source, which contains two main subsets: version v1, containing 46,000 movie trailer shots annotated, and version v3, covering 10 full-length movie shots annotated. Considering that the goal of the study focuses on shot scale identification, the v2 version of the data, which only contains information about the movement of the shot, was excluded.
Regarding the issue of the lack of bounding box annotations for raw data, this study established a standardized spatial proportion model based on different shot scales, namely Long Shot (LS), Full Shot (FS), Medium Shot (MS), Close-up (CU), and Extreme Close-Up (ECU), generating representative spatial distribution features for each type of shot. The study specifically set the Long Shot (LS) to occupy 30% of the screen space. The Full Shot (FS) and Medium Shot (MS) are set to occupy 50% and 70% of the screen space, respectively. The Close-Up (CU) expands to 90% of the screen ratio. The Extreme Close-Up (ECU) covers the entire screen area.
After the data is standardized, it is divided into a training set, a validation set, and a test set in the ratio of 7:2:1 to ensure the scientific nature of model training and evaluation.
B. Model design and development
The model design and development is divided into three steps: environment configuration, model training, and video analysis. The environment is configured by cloning the YOLOv5 repository and installing dependencies, and the lightweight and efficient YOLOv5s is chosen as the core model, which is combined with OpenCV, FFmpeg, and PySceneDetect to process the video data.
First is the training phase. In this study, pre-training weights are used, 640×640 images are input, the batch size is 16, and 50 epochs are trained. In the configuration file, the dataset paths and category names are specified, meanwhile, the shot scale names are ‘long’, ‘full’, ‘medium’, ‘close’, and ‘extreme’, and the model is trained. Since no GPU was used and only the CPU was relied upon, the total time to train about 56,000 images (v1 and v3 combined) was about 150 hours.
Second, a modeling application based on PySceneDetect, an open-source tool and Python library for video scene segmentation and transition detection. The study uses PySceneDetect to automatically analyze video content, detect scene transitions, and segment the video into separate clips.
Third, the model prediction application stage. The study sets the video analysis function as the data input of an MP4 file. Firstly, through PySceneDetect, whose content detection threshold is 30. When the model prediction is performed, PySceneDetect segments the footage and extracts the keyframes to the corresponding storage directory, and then classifies the keyframes with the trained model, and finally integrates the results to generate a CSV report, which contains the footage ID, type, duration, and time range. This process has achieved automated analysis from video input to structured output, with the output samples shown in the table below.
Shots ID | Typology | Duration | Time Phase |
---|---|---|---|
001 | long | 12.5 s | 00:00 – 00:12.5 |
002 | close | 3.8 s | 00:12.5 – 00:16.3 |
003 | medium | 7.2 s | 00:16.3 – 00:23.5 |
C. Model evaluation and analysis
To comprehensively assess the model performance, this study adopts a multi-level testing scheme. First, in terms of quantitative evaluation, the study selects common metrics such as accuracy, precision, and recall, and conducts rigorous testing on the model using an independent test set. Second, through confusion matrix analysis, this study examines the model’s ability to recognize different types of shots.
Evaluation Metrics. Accuracy, Macro-averaged F1 Score, and mAP@0.5 (Mean Average Precision at IoU=0.5) are three commonly used performance metrics for different task scenarios. Accuracy is a measure of the proportion of correct predictions of the model as a whole and is calculated as the ratio of the number of correctly predicted samples to the total number of samples, which has the advantage of being intuitive and easy to understand but can be misleading when the distribution of categories is unbalanced. The macro-average F1 score is used to evaluate model performance by calculating the average of the F1 scores for each category, where the F1 score is the harmonic mean of Precision and Recall. This metric equally treats all categories, making it particularly suitable for multi-class classification tasks where there is class imbalance.mAP@0.5, which is mainly used for target detection or information retrieval tasks, evaluates the combined ability of the model to localize and classify by calculating the average accuracy mean at an intersection and concurrency ratio (IoU) threshold of 0.5. The core idea is that a prediction box is only considered correct when the overlap area with the ground truth box reaches 50% or more.
IV. MODEL PREDICTION PERFORMANCE AND ANALYSIS
A. Dataset for model evaluation
This study constructs a test set based on the official MovieShots Dataset, covering labeled data from two subsets: v1 (movie trailers) and v3 (full movies). First, the total number of shots in the test set is 11,700, of which v1 accounts for 78.6%, 9,200 shots, v3 accounts for 21.4%, and 2,500 shots. The distribution of shot scales shows a significant imbalance, as shown in Table 2. In Table 2, Medium Shot (MS) has the highest proportion, 31.5%, and Extreme Close-Up (ECU) has the least, 5.3%. This distribution reflects the natural proportion of movie shots in real scenes and provides a realistic basis for model generalization ability assessment.
Typology | Sample size | Percentage |
---|---|---|
LS | 1,976 | 16.9% |
FS | 2,900 | 24.8% |
MS | 3,680 | 31.5% |
CU | 2,120 | 18.1% |
ECU | 624 | 5.3% |
B. Model performance evaluation
The overall performance of the model on the test set is excellent, with an accuracy of 85.7% and an average precision mean (mAP@0.5) of 86.8%, as shown in Table 3. The macro-mean accuracy of the model is 84.3%, and the recall is 83.9%. Meanwhile, the two differences between is less than 0.5%, indicating that the model is not significantly biased in the face of category imbalance. mAP@0.5 is slightly higher than the accuracy (1.1% higher), which may be attributed to the gain effect of bounding box estimation on the classification task. To validate the model stability, five replicate experiments were conducted and showed good robustness. These results indicate that the model has high reliability and consistency in the shot scale recognition task.
Metric | Value | Hidden meaning |
---|---|---|
Accuracy | 85.7% | Reflects overall classification reliability |
Macro average F1 | 84.1% | Impact of imbalances in balanced categories |
mAP@0.5 | 86.8% | Accuracy of integrated positioning and classification |
In this study, the system of statistical test methods was used to verify the performance differences between different shot scales. Firstly, the chi-square test was used to analyze the difference between the classification results of the five-shot scale categories. The test statistic was χ²=42.7 (df=4), and the results showed p<0.01, which indicated that the difference in performance between the categories was statistically significant. In addition, this study also analyzed pairs of confusing Medium-shot (MS) and Full Shot (FS) samples (n=1,248 pairs of confusing samples) by McNemar’s test, obtaining a significant result of p=0.003, which confirms that the pattern of confusion between the two categories is not a random phenomenon, but rather stems from the similarity of their compositional features, such as the proportion of subjects in two categories is close to the transition range of 30-50%.
The statistical results show that there is a significant difference between the models predicting in shot scales, as shown in Table 4. Long Shot (LS) and Close-Up(CU) achieved the highest F1 scores of 85.3% and 86.0% respectively. The study analysis found that Long Shot and Close-Up were recognized with high accuracy, thanks to the strong discriminability of their visual features. During the period, LS usually contains broad scenes, while CU is marked by a high percentage of subjects. Meanwhile, Medium Shot (MS) has the lowest recall, 81.4%, mainly due to its confusion with Full Shot (FS) in boundary cases. Finally, since Extreme Close-Up (ECU) has the smallest sample size, the F-score of the model for the recognition of this shot scale is 82.9%, indicating that data sparsity has some impact on performance stability.
Typology | Accuracy | Recall rate | F1 | Number | Confidence interval (95%) |
---|---|---|---|---|---|
LS | 87.5% | 83.2% | 85.3% | 1,976 | [85.1%, 89.9%] |
FS | 82.1% | 84.6% | 83.3% | 2,900 | [80.3%, 83.9%] |
MS | 83.8% | 81.4% | 82.6% | 3,680 | [82.1%, 85.5%] |
CU | 86.3% | 85.7% | 86.0% | 2,120 | [84.5%, 88.1%] |
ECU | 83.7% | 82.1% | 82.9% | 624 | [80.2%, 87.2%] |
C. Confusion Matrix and Error Patterns
In order to analyze the relationship between the model’s prediction results and the true results, this study uses the confusion matrix to provide a comprehensive and intuitive analysis of the model identification results, First, the diagonal values of the confusion matrix provide an intuitive picture of the correctness of the model’s predictive performance, i.e., how many samples were correctly predicted for each prediction category. Meanwhile, the off-diagonal values of the confusion matrix indicate which categories are easily confused, i.e., patterns that are incorrectly recognized. Finally, by comparing the diagonal values (correctly recognized results) with the off-diagonal values (incorrectly recognized results), the differences in the performance of different categories, i.e., which one has a high correct rate, and which one has a low prediction accuracy, can be detected.
Standardized confusion matrix, as shown in Table 5. Table 5 shows the main error patterns in the predictions of the model. According to the data, first of all, the inter-error between FS and MS is the most significant with 33.5% (627/1,872 ). This error pattern was frequent in shots with a subject share between 55-65%. Specifically, the analysis of the recognition results reveals that the number of samples in which FS and MS are misidentified with each other is 627, most of which are half-body dialog scenes, in which the characters tend to occupy 55-65% of the frame. Next, the percentage of ECU misidentified as CU was 16.3% (102/624). The study found that among the 102 error samples examined, the samples were extreme close-up of faces and close-up above the shoulders. In addition, LS was misjudged as FS in 9.4% (186/1,976) of the cases, which was common in distant shots containing significant foreground objects (e.g., trees or buildings), where background interference led to model judgment errors.
LS | FS | MS | CU | ECU | Total | |
---|---|---|---|---|---|---|
LS | 1644 | 186 | 92 | 34 | 20 | 1,976 |
FS | 142 | 2,453 | 245 | 40 | 20 | 2,900 |
MS | 98 | 382 | 2,995 | 145 | 60 | 3,680 |
CU | 28 | 52 | 138 | 1,817 | 85 | 2,120 |
ECU | 10 | 24 | 58 | 102 | 430 | 624 |
In order to further clearly demonstrate the percentage of errors for each shot scale of the model, this study transformed the confusion matrix of the original into percentage form. Based on this, three main error patterns were found in this study. First, FS and MS mutually misrecognize the pattern. According to Table 6, FS misidentified MS in 8.4% of FS row data and MS misidentified FS in 10.4% of MS row data. These two categories have the highest error rates, indicating that the model does not have high recognition accuracy on these pairs of categories. Secondly, ECU is misidentified as CU. The data in the ECU row shows that the error of ECU is mainly concentrated in CU, and the error rate of 16.3% indicates that these two close-up shots are easily mixed. Third, LS was misidentified as FS mode. The LS row of data indicates that the 9.4% error rate reflects that the main error of LS is FS.
LS | FS | MS | CU | ECU | |
---|---|---|---|---|---|
LS | ■83.2% | 9.4% | 4.7% | 1.7% | 1.0% |
FS | 4.9% | ■84.6% | 8.4% | 1.4% | 0.7% |
MS | 2.7% | 10.4% | ■81.4% | 3.9% | 1.6% |
CU | 1.3% | 2.5% | 6.5% | ■85.7% | 4.0% |
ECU | 1.6% | 3.8% | 9.3% | 16.3% | ■68.9% |
V. CASE ANALYSISS
The goal of the case study is to validate the usefulness of the model in teaching and learning and to assess its consistency with the Professor’s annotations, time efficiency improvement, and student acceptance. First, a 30-minute classroom simulation of the classic “Odessa Steps” from Battleship Potemkin was conducted to apply the model to teaching practice. After the teaching activity, a technology acceptance (perceived usefulness and ease of use of the system) questionnaire and unstructured interviews were conducted to assess students’ acceptance and feedback on the use of the model in the teaching activity.
A. Problems and reasons for developing educational statistics.
This study was conducted in a 30-minute “Audiovisual Linguistic Analysis” course at a university in Northeast China. The content of the course was selected from the “Odessa Stairs” segment of “Battleship Potemkin”, which is about 5 minutes long and contains 30 shots. The task of the course is to explain to the students the meaning and function of the different shots in the “Odessa Steps” clip. During the course, the instructor explains the meaning of the different shots and then instructs 10 junior film and television writing and directing students to use the model to identify the shot categories. Finally, the Professor also provided the manual labeling results for comparison and to introduce the functions of different shots to the students.
B. Steps in teaching activities.
The teaching activity consists of five steps. In the first step, the Professor introduces the purpose of the instructional activity, the process, and the method of identifying the demonstration model. In the second step, the student inputs the video into the model and obtains the automatic recognition result. The recognition result is a CSV file containing the shot ID, type, duration, and time range. In the third step, the Professor provides the manual labeling results and compares them with the model identification to mark the model’s out-of-error judgments. In the fourth step, students analyze the camera language of the clip based on their results, for example, Close-up shots enhance tension and long shots create a chaotic atmosphere. In the fifth step, students’ acceptance and feedback on the model were collected through questionnaires and interviews after the class.
C. Data analysis.
This study evaluates the consistency of model predictions with Professor annotations, time efficiency gains, and student acceptance. First, model predictions based on the Battleship Potemkin “Odessa Steps” clip (YOLOv5, mAP@0.5=85%) were analyzed against the results of manual annotation by the Professor . Secondly, they were analyzed feedback from the questionnaire and interviews on technology acceptance of 10 students at.
First, the consistency analysis between model predictions and Professor annotations shows that the model has high recognition performance in teaching practice. Table 7 shows the complete labeling results of the model and Professors for 30 shots. The model predicts a distribution of: 8 (27%) for Long Shots (LS), 10 (33%) for Full Shots (FS), 7 (23%) for Medium Shots (MS), 4 (13%) for Close-Up (CU), and 1 (3% ) for Extreme Close-Up(ECU). In contrast, the distribution of Professor annotations is as follows: LS 9 (30%), FS 11 (37%), MS 6 (20%), CU 3 (10%), ECU 1 (3%). The total number of consistent shots between the two is 27, with a consistency rate of 90% (27/30).
It is worth noting that the differences mainly appear in three shots. They are shot 005, which the model predicts MS and the Professor labels FS. Shot 005 is about the soldier advancing, which the model may misclassify as a medium shot due to dynamic blur, while the Professor labels FS based on the sense of dynamics of the panorama. Shot 018, in which the model predicts MS and the Professor labels CU. Shot 020, in which the model predicts CU and the Professor labels MS. The shot is about the stroller sliding down, the model did not recognize the close-up boundary, and the Professor emphasized its emotional reinforcement. These differences suggest that the model was less precise in judging boundaries in the rapidly edited montage shot, while the Professor labeling was more contextual and symbolic. Nonetheless, the 90% consistency rate shows that the model is effective in aiding shot scale recognition in teaching scenes.
ID | Type | Duration | Manual labeling | Duration |
---|---|---|---|---|
1 | LS | 5 | LS | 5 |
2 | CU | 3 | CU | 3 |
3 | FS | 3 | FS | 3 |
4 | CU | 2 | CU | 2 |
5 | MS | 3 | FS | 3 |
6 | FS | 3 | FS | 3 |
7 | CU | 2 | CU | 2 |
8 | FS | 3 | FS | 3 |
9 | MS | 2 | MS | 2 |
10 | CU | 3 | CU | 3 |
11 | MS | 8 | MS | 8 |
12 | FS | 2 | FS | 2 |
13 | MS | 2 | MS | 2 |
14 | FS | 2 | FS | 2 |
15 | MS | 2 | MS | 2 |
16 | LS | 10 | LS | 10 |
17 | CU | 2 | CU | 2 |
18 | MS | 2 | MS | 2 |
19 | FS | 2 | FS | 2 |
20 | CU | 2 | CU | 2 |
21 | MS | 2 | MS | 2 |
22 | FS | 2 | FS | 2 |
23 | ECU | 4 | ECU | 4 |
24 | LS | 3 | LS | 3 |
25 | FS | 3 | FS | 3 |
26 | LS | 6 | LS | 6 |
27 | MS | 3 | MS | 3 |
28 | FS | 3 | FS | 3 |
29 | LS | 3 | LS | 3 |
30 | FS | 6 | FS | 6 |
Furthermore, the comparison of time efficiency highlights the significant advantages of the model. The model processes a 3-minute video and outputs the results of 30 shots in about 2 minutes, whereas a Professor manually annotating the same segment takes about 20 minutes, resulting in an efficiency improvement of approximately 10 times. This result shows that the model can significantly reduce Professors’ time investment in classroom preparation. For example, footage 016 is a 10-second long shot of a stroller sliding down, and the model categorizes it in a few seconds, while the Professor needs to observe and record it frame by frame. In addition, the model output of the CSV file (containing shot ID, type, duration, and time range) provides students with structured data that can be easily discussed and analyzed in class. Students can also directly utilize these data to explore the function of long shots in creating chaotic atmospheres or medium shots in highlighting characters’ emotions.
Again, student acceptance of the model was assessed through questionnaires and interview data. Table 8 displays the responses of 10 students to the technology acceptance questionnaire (5-point scale), covering system cognitive usefulness (Questions 1-7) and ease of use (Questions 8-14). The results indicate that the average score of 4 (agree) or above accounts for 85%, showing that students generally recognize the practicality and ease of operation of the model. For example, question 9 (“Does not require too much time and effort”) has an average score of 4.5, reflecting a high acceptance of time efficiency. Question 4 (“Provides useful information”) has an average score of 4.6, indicating that students find the model’s output helpful for learning.
Finally, feedback from the interviews. According to the analysis of the interviews, student S1 said that “it saves a lot of time, and I can focus on analyzing the director’s approach”, and S7 thought that “the results are similar to the Professor’s results, so I feel relieved to use it”, which reflects the advantages of the model in terms of efficiency and reliability. However, S2 pointed out that “close-ups and medium shots are indistinguishable,” and S4 suggested that “explaining directly why this shot is used would be better,” indicating that there is room for improvement in the model’s boundary recognition and functional explanation. Overall, the students were more receptive to the model, thinking that it improved their learning efficiency and sense of participation, but also looking forward to smarter analysis functions and interface optimization.
Question No. | Content of the question | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | S9 | S10 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Enriching learning | 4 | 5 | 4 | 4 | 5 | 4 | 3 | 4 | 5 | 4 |
2 | Helps to learn new things | 5 | 4 | 5 | 4 | 4 | 5 | 4 | 5 | 4 | 5 |
3 | Smooth guidance mechanisms | 5 | 5 | 4 | 4 | 5 | 4 | 5 | 4 | 5 | 4 |
4 | Providing useful information | 3 | 4 | 3 | 4 | 4 | 3 | 4 | 3 | 4 | 3 |
5 | Concept maps provide information | 4 | 5 | 4 | 5 | 4 | 5 | 4 | 4 | 5 | 4 |
6 | Learn better | 4 | 5 | 4 | 5 | 4 | 5 | 4 | 4 | 5 | 4 |
7 | More effective than general computer-assisted learning | 5 | 4 | 5 | 4 | 5 | 4 | 3 | 5 | 4 | 5 |
8 | Not difficult to operate | 4 | 5 | 4 | 3 | 4 | 5 | 4 | 4 | 5 | 4 |
9 | Doesn’t require much time or effort | 5 | 4 | 5 | 4 | 5 | 4 | 5 | 4 | 4 | 5 |
10 | The content is clear and easy to understand | 4 | 5 | 4 | 5 | 4 | 4 | 3 | 5 | 4 | 4 |
11 | Learn how it works in no time | 5 | 4 | 5 | 4 | 5 | 5 | 4 | 4 | 5 | 5 |
12 | Difficulty-free operation process | 4 | 5 | 4 | 4 | 4 | 5 | 4 | 3 | 5 | 4 |
13 | Interface is easy to use | 4 | 4 | 3 | 5 | 4 | 4 | 4 | 4 | 5 | 4 |
14 | Overall ease of use | 5 | 5 | 4 | 4 | 5 | 4 | 5 | 4 | 5 | 5 |
VI. DISCUSSION
The YOLOv5-based shot scale automatic recognition model developed in this study demonstrates significant advantages in film and television education. Compared to traditional manual analysis methods [5], the present model reduces the analysis time from tens of minutes to a few minutes, which is more than a 20-fold increase in efficiency, while the classification agreement rate reaches 90%, which is much higher than the subjective differences among experts [13]. This result validates the potential of deep learning in film and television analysis, especially in standardizing output and reducing human bias. Compared to traditional rule-based tools, YOLOv5’s end-to-end processing capability enables Professors to focus more on instructional design rather than tedious labeling tasks.
However, the model still has some limitations. First, the bounding box estimation relies on preset rules, which may lead to detail errors and can be optimized by semi-automatic annotation in the future. Second, the current model only supports shot scale category recognition, which does not cover dimensions such as motion type and needs to be extended to multimodal analysis [15]. In addition, the generalizability of the model to specific styles such as experimental film or animation still needs to be verified. These limitations provide directions for future research, such as developing joint detection frameworks or enhancing data diversity.
VII. CONCLUSION
In this study, we design and validate a YOLOv5-based shot scale automatic identification model, which effectively solves the problems of low efficiency and subjectivity of analysis in film and television education. Experiments show that the model achieves 85.7% accuracy on the MovieShots Dataset and saves time and cost in teaching applications. Meanwhile, according to the analysis of the questionnaire information, the model application enhances student engagement with the model. The core value is that provides Professors with an efficient tool to reallocate preparation time to instructional design. At the same time, it provides students with immediate feedback to deepen their understanding of the director’s language. To build a large-scale case base for the course, expanding the possibilities of teaching and learning.
This study still has certain limitations that warrant attention. First, the model has limitations in dynamic shot recognition. There is still room for improvement in the performance of the model in fast motion shots and complex scene transitions, and the coherence analysis of the shot movement is insufficient, so we need to optimize the dynamic recognition ability by combining it with the temporal sequential modeling method in the future. Second, small-scale teaching verification. The application practice involves only 10 students with a small sample size, which may affect the generalizability of the conclusions. Future research can integrate shot motion recognition to enhance the model’s ability to recognize both static and dynamic shots.
Reference
[1] Pellegrini, M., & Vivanet, G. (2021). Evidence-based policies in education: Initiatives and challenges in Europe. ECNU Review of Education, 4(1), 25-45.
[2] Bordwell, D., Thompson, K., & Smith, J. (2004). Film art: An introduction (Vol. 7, pp. 477-479). New York: McGraw-Hill.
[3] Hasson, U., Landesman, O., Knappmeyer, B., Vallines, I., Rubin, N., & Heeger, D. J. (2008). Neurocinematics: the neuroscience of film. projections, 2(1), 1-26.
[4] Branigan, E. (2013). Projecting a camera: language-games in film theory. Routledge.
[5] Pearlman, K. (2012). Cutting rhythms: Shaping the film edit. Routledge.
[6] Cutting, J. E. (2021). Movies on our minds: the evolution of cinematic engagement. Oxford University Press.
[7] Bai, Y., Zhao, F., Kumamoto, E., & Yin, C. (2024). Precisely identifying and mitigating shallow reading behaviors through online reading data: exploring quantitative reading behavior methods and precise behavioral intervention. Interactive Learning Environments, 1-15.
[8] Zhao, F., Liu, G. Z., Zhou, J., & Yin, C. (2023). A learning analytics framework based on human-centered artificial intelligence for identifying the optimal learning strategy to intervene in learning behavior. Educational Technology & Society, 26(1), 132-146.
[9] Zhao, F., & Yin, C. (2021, July). Data collection in the learning behavior analysis: A systematic review. In 2021 10th International Congress on Advanced Applied Informatics (IIAI-AAI ) (pp. 178-181). IEEE.
[10] Monaco, J. (2009). How to read a film: Movies, media, and beyond. Oxford University Press.
[11] Eisenstein, S. (1969). Film form: Essays in film theory (J. Leyda, Trans.). Harcourt Brace Jovanovich.
[12] Jacobs, M. (2023). Movies on Our Minds: the Evolution of Cinematic Engagement by James E. Cutting. the Velvet Light Trap, 92(1), 68-70.
[13] Perry, M. S. (2018). 21st Century Skills through Film Production in Tertiary Education: A transformative Assessment in a Literature and Media Course. 3L: Southeast Asian Journal of English Language Studies, 24(4).
[14] Pearlman, K. (2019). On rhythm in film editing. The Palgrave handbook of the philosophy of film and motion pictures, 143-163.
[15] Stenliden, L. (2014). Visual Storytelling Interacting in School. Learning Conditions in the Social Science Classroom. linköpings universitet.
[16] Chen, H., Wang, L., & Zhang, W. (2023). FilmNet: A multimodal dataset for film shot analysis. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11245-11254
[17] Li, Y., Min, X., Xu, K., & Jiang, S. (2022). MovieShots: a large-scale dataset for movie scene understanding. IEEE Transactions on Multimedia, 24, 1234-1245.
[18] Jocher, G., Stoken, A., Chaurasia, A., Borovec, J., Kwon, Y., Michael, K., … & Thanh Minh, M. (2021). Ultralytics/yolov5: v6. 0-YOLOv5n ‘Nano’ models, Roboflow integration, TensorFlow export, OpenCV DNN support. zenodo.
All authors contributed equally to this work.
Authors
Yu Bai
was born in 1988 in China. She is an Experimentalist at the School of Media Science, Northeast Normal University, China. Her research focuses on online learning, digital media, and learning behavior analysis.
Xinyang Li
was born in 2001 in China. She is currently pursuing graduate studies at the School of Media Science (School of Journalism) of Northeast Normal University. Her research interests include film and television aesthetics and Northeast Asian film and television culture studies.