Video Structure Analysis

期刊名字：清华大学学报（英文版）
文件大小：548kb
论文作者：ZHANG Songhai，ZHANG Yifei，CHEN
作者单位：Department of Computer Science and Technology,School of Computer Science
更新时间：2020-11-22
下载次数：次

论文简介

TSINGHUA SCIENCE AND TECHNOLOGYISSN 1007-0214 14/20 pp714-718Volume 12, Number 6, December 2007Video Structure Analysis*ZHANG Songhai (张松海)"”, ZHANG Yifei (张一飞), CHEN Tao (陈韬>,Peter M. Halr, Ralph Martin'1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;2. Department of Computer Science, University of Bath, Bath BA2 7AY, United Kingdom;3. School of Computer Science, Cardiff University, Wales CF24 3AA, United KingdomAbstract: Video structure analysis is a basic requrement for most content-based video editing and process-ing systems. This paper presents a fast video structure analysis method based on image segmentation ineach frame, with region matching between frames. The structure analysis decomposes the video into sev-eral moving objects, including information about their colors, positions, shapes, movements, and lifetimnes.The method also supports user interactions to improve the results. The result shows that this method is fastand stable and can complete video analyzing interactively.Key words: video structure analysis; video segmentation; image segmentation; region matchingpublications. For example, a video-tooning approach"'Introductionpays particular attention to providing coherent segmen-tation in the space-time space to re-render the video asVideo editing is widely used not only in professionala cartoon animation. A recent paper on motion layer-movie making, but also in many other areas such asbased object removal in videos'- also proposed a mo-home vidcos, educational videos, and marketing videos.tion-based video segmentation method as a preparatoryWith the popularization of video cameras, even morestep to object removal.potential applications are developing, Even relativelyIn many appications, this video analysis and seg-straightforward applications may wish to edit the videomentation step is the most time consuming step in the(for example, to remove an unwanted object from aapplication. For example, video tooning may take sev-scene) to process it in some other way (for example, toeral hours to segment a short video clip. This paperre-render it as a cartoon). In general, such tasks needpresents a fast video structure analysis method basedan understanding of the structure of the video content,on image segmentation in each frame, with regionas a basis both for editing and for more advanced andmatching between frames,complex video processing operations.The need to treat the whole video in a coherent1 Related Workmaner has been emphasized in several previousThe two main kinds of information in any video are theReceived: 2007-06-25color appearance and the visual motion, which leads to* Supported by the National Key Basic Research and Developmentthree approaches to video analysis. (1) The first is(973) Program of China (No. 2006CB303 106), the SpecializedResearch Fund for the Doctoral Program of Higher Education ofbased on motion tracking') as in the method used inMOE, P.R.C. No.20003057), and the Basic Ressearch Foun-中国煤化工moval4. Such ap-dation of Tsinghua National Laboratory for Information Scienceproac1 many highly tex-and Technology (TNList)tured:TYTHC N M H Gons, where trackers* * To whom corespondence should be adresse.:an readily follow moving objects and describe theirE-mail: zhangsh@gnail.com; Tel: 86 10-62797001-807ZHANG Songhai (张松海) et al: Video Structure Analysis715motion with a simple model. (2) Another approach isbodies and motions as in Fig. 1, whereas histogram-based on color and appearance. Such approaches arebased methods, i.e., color quantizaion, work wellmore suited to videos with relatively lttle texture andfor videos containing non-igid motion, for examplesimple scenes. They can, to a certain extent, cope withmoving liquids, fire, and clouds as shown in Fig. 2.various changes in objects, including deformation,splitting, and color changes. Such approaches can befurther subdivided into space-time segmentation, suchas performed by the anisotropic kermnel mean shift algo-rithm'", and segmentation in each frame separately fo]-lowed by matching regions between frames, as in thestroke surfaces approach!. Space-time segmentationFig, 1 Mean shift segmentationhas the advantage that it identifies the entire object byobservations throughout the video, which can, for ex-ample, help solve issues of occlusion, but such ap-proaches generally have long-run times and can besensitive to the speed of the object's motion and tovariations in the object shape, color, and lighting, sincethis is a global approach. The stroke surfaces method isfaster and can generally cope quite well with objectsFig.2 Histogram-based segmentationwhose appearance changes, but its object matchingability needs to improve. The method described hereEven when using the most suitable algorithm for afollows a similar scheme of image segmentation ingiven video, the segmentation result for each frameeach frame, with region matching between frames. Themay typically have hundreds of regions, which is still agreedy algorithm quickly matches regions betweenvery complex representation. A higher level resultframes quickly, and combines motion tracking andneeds to merge these regions in an appropriate way andsimple user interaction to improve the results. (3) The .to find associations between regions in adjacent framesthird approach is based on both color and motion in-of the video. Regions belonging to the same object areformation. Such approaches often introduce a statis-connected to derive a three-dimensional object intics-based model16-8J and use an expectation maximumspace-time. A given region may, for example, occluded(EM) algorithm to solve the optimization problem.by some other object in any given frame, in which caseSuch approaches lose the efficiency and the conven-the region is split in that frame with the space-timeience for user interaction, and are not suitable to fastrepresentation branching at that frame. Conversely,video structure analysis processing.parts of a region may merge when the occlusion disap-pears, so the object branches also merge in the space-2 Methodtime description.The video structure analysis algorithm combines imagesegmentation and region matching. The algorithm firstsegments each frame into regions based on the colorproperties. Image segmentation has been extensivelystudied911, and any appropriate methods can be usedFrame 2Frame nin this system. However no segmentation algorithm isa universal method suitable for all kinds of videos, andFig3 Video structure grapheach one has different efficiency characteristics andThe graph G(V, E, E) shown in Fig. 3 is definedapplicability. Thus, the segmentation algorithm shouldto rep中国煤化工nodesetViscom-be appropriate to the nature of the input video. Of theposed; resulting from thewell-known algorithms, tests show that mean shiftgmeutuns..山1111 LugYHCNMHGset that connectssegmentation!ol works well on videos with rigid716Tsinghta Science and Technologr, December 2007, 12(6): 714-718nodes within a given frame, which means that two re-a threshold weight wT, which is a user set parameter.gions within the frame belong to the same object. Cer-E and E。determine the video structure and eachtain inner edges may be placed by the user, since hu-connected sub-graph represents an object. If E andman intelligence may be able to detect meaningfulE。are NULL, each region has become an isolatedconnections which are very difficult for an algorithm toobject which only appears in one frame. The optimaldeduce. E。is an outer edge set that connects nodesstructure of E and E。 can be formulated as abetween neighboring frames, denoting correspondingmaximization problem: max & w。，subject to theregions between frames. A region may have no corTe-sponding regions in a previous frame, meaning that theconstraint that no N-N correspondences are formed be-region has come into existence in that frame. Similarly,tween regions, where e means the edge in E。. (Notea region may have no corresponding regions in a sub-that considering the connectivity of nodes in pairs ofsequent frame, meaning that the region vanishes in thatsuccessive frames is not sufficient to ensure that no N-frame. More generally, as well as the obvious 1-1 cor-N corrtespondences between regions exist, as edges inrespondences between regions in adjacent frames, 1-Nprevious or later frames may eventually join thesecorrespondences will also occur with multiple outernodes into a single region).edges connecting regions in the next frame, as objectssplitting or as occluded. N-1 correspondences with3 Edge Weighting Functionmultiple outer edges between regions in the previousframe represent object merging or no longer being 0C-The edge weight measures the similarity between re-cluded. To simplify the problem, the algorithm as-gions. The edge weighting function is set to negativesumes that there are no NN correspondences in theinfinity if the regions are too different (to prevent theseregions from merging) and is set to the weighted sumgraph.The problem is represented as a minimization prob-of the weights of the three components:lem by adding a virtual node representing a NULL re--∞，spatialDis > 4 or colorDis> 4;gion in each frame. Any regions in the next frame[-w x|areaRatio - w2 x spatialDis - w; x colorDis.which come into existence are connected by a virtualareaRatio is the logarithm of the ratio of the areas ofedge to this NULL region, and any regions which dis-the two regions. colorDis is the distance in the CIE-appear in the previous frame are again connected by aLab color space between the average pixel colors. spa-virtual edge to the NULL region. In this way, aparttialDis measures the spatial distance between two re-from the NULL regions, every region has at least onegions which is approximated by ellipses. The representedge connecting it to some region in the next frame,of an ellipse E is composed of c as the center, a and band at least one edge connecting it to some region inas two lengths of axes, and θ as the angle between thethe previous frame.long axis and x axis. And the distance of two elipsesA weight is associated with each edge to representE and E2 is shown as below, p is a point. spatialDis =the similarities between regions. Larger weights indi-D (E, Ez)=D\ (Er, C2) + DI (E2, C1).cate greater similarity between two regions, Virtualedges corresponding to 0-1 and 1-0 relationships havexp -x。T(acos) +(bsin 0)2 sinOcos0(b2 -a)\[*,-x]D(E,p)=[yp-Yc」 sin0cos0(b -a') (bcos6)}2 +(asin0)}儿yp-y. ]alThe weights and thresholds are chosen according tothe characteristics of the specific video. OftenCorrespondence出=0.8，以=0.6, w = 0.2. If objects move rapidlyThe中国煤化工be solved exactlyin the video, 4 should be larger, typically 30. Iflighting changes or the given objects change color, 4sinceMYH:refore, a greedy al-. CN MH Gregin coetvityshould be larger, typically 10.structure. The problem is solved frame-by-fame tZHANG Songhai (张松海) et al: Video Structure Aralysis717determine correspondences between sucessive frames,similar in color and position. Any remaining isolatedstarting ftom the first fame. In each frame, the objec-regions in frame i are assumed to have disappeared intive is to find as many correspondences as possiblethe next frame, while any remaining isolated regions inwhile observing the constraint that there should be noframe i+1 are assumed to be newly born objects.N-N correspondences. In practice, the algorithm em-After Step 1, the user can add region correspon-phasizes forming 1-1 correspondences. Details of thedences into E。by hand, or a motion tracking algo-correspondence computation for frame i are shown be-rithm can be used to determine the desired correspon-low. The k-th region in fameiis R*.dences. Such edges are added and marked before pro-Step 1 Construct the new region set for frame i. Ifceeding to Step 2.i=1, ie, the first frame, the new region set is simplyAfter finding the correspondences between all suc-set to the segmentation result. For other frames, searchcessive pairs of frames, each connected sub graph infor any regions which connect to the same region inthe final result corresponds to a single object. Thethe previous frame, and merge these into a new biggerinformation about this object is then derived from theregion. Then recompute its area, average color, andsub graph, including its shape, color, and behavior overmodeling ellipse, Also combine any regions for whichtime.the user has defined an edge in E,Step 2 Find 1-1 correspondences. A 1-1 corre-5 Resultsspondence is regarded as being the most stable andThe method was tested using the bearwavi,avi video,plausible relationship. Such relationships generallycourtesy of Peter Hall of the University of Bath, usinghave a large weight, representing regions of similara computer with a 2.4-GHz CPU and 512 MB memory.color and size that move slowly without occlusions.The EDISON mean shift algorithm[9] was used for theCompute each region in frame i and each region inimage segmentation. The segmentation speed dependsframe i+1. If for any region in frame i, the edge be-on the parameters used. If the color bandwidth andtween R and R( has the maximum weight and atspatial bandwidth are set to 4, the segmentation takesthe same time, for any region in frame i+1, the edgeonly a few seconds for each frame and produces morebetween R* and R，also has the maximum weight,than 1000 regions per frame. The algorithm then takesthen construct a 1-1 correspondence and add it intoabout 8 s to matching regions in 30 frames. Setting theE。. Then mark these edges as used and repeat thecolor bandwidth and spatial bandwidth to 14 gives bet-process for the remaining edges until no more 1-1 cor-ter segmentation results for the automatic regionrespondences of this kind can be found. Note that suchmatching as shown in Fig. 4, but it then takes more1-1 correspondences may be changed to 1-N or N-1than 60 s to segment each frame, producing about 80correspondences in latter steps. For example, a big re-regions per frame. The region matching was then donegion could have edges with a region having a similarin less than one second. Several tests showed that thearea and several small regions in the next frame.overall success rate for automatic region matching isStep 3 Find 1-N and N-1 correspondences. Sort theabout 68%，remaining edges in descending order of weight. ThenThe matching process was then extended from theconsider each edge in turn and add the edgeto E。ifstroke surfaces method, which can only handle someit satisfies two constraints. First, adding the edge mustideal cases. The method gives much better run timenot create an N-N correspondence. Second, the areaperformance than video tooning. As the algorithmdifference between the one big region and the sum ofprocesses each frame, the user can interactively calcu-the areas of the N small regions should be no morelate the video structure. The algorithm continues imagethan 10%, since objects remain a similar size fromsegmentation on the next frame while the user interac-ching on the currentframe to frame, even if they fragment.Step 4 Attach isolated regions in frame i andfran中国煤化工t conflict on a dualframe i+1. Join any isolated regions with neighbor-coreYRCNMHGing regions within the frame if they are suficiently718Tsinghua Science and Techmology, December 2007, 12(6): 714-718(a) Result image(b) Segmentation result(C) Automatic region matching resultFig.4 Results of segmentation and region matching. Top row: results for frame 1; Bottom row: results for frame 5.[3] Shi J, Tomasi C. Good features t0 track. ln: Proceedings of6 Conclusions and Future WorkIEEE Conference on Computer Vision and Patterm Recog-nition (CVPR'94). Seattle, USA, 1994: 593-600.This paper describes a fast video structure analysis[4] Wang J, Thiesson B, Xu Y, et al. Image and videomethod based on image segmentation and regionsegmentation by anisotropic kermel mean shift. In:matching. The structure analysis decomposes the videoProeedings of European Conference on Computer Vision.into several moving objects, giving information aboutPrague, Czech, 2004: 238-249.the color, position, shape, movement, and life time of[5] Colomosse J P, Rowntree D, Hall P M. Stroke surfaces:each object. This method also supports user interactionTemporally coherent aristic animations from video. IEEEto improve the results; the interaction needed isTransactions on Visualization and Computer Graphics,straightforward and obvious. Any image segmentation2005, 11(5): 540-549.algorithm can be used in the first step to suit the source[6] Chang M M, Tekalp A M, Sezan M L Simultaneous mo-video characteristics. A fast greedy algorithm is used totion estimation and segmentation. IEEE Trans. on lmageapproximately optimize the structure graph of the im-Processing, 1997, 6(9): 13261333.age segmentation results.[7] Zinick C L, Jojice N, Kang s B. Consistent segmentationOne minor problem is that the algorithm may causefor optical flow estimation. In: Proceedings of the Tenthobjects to become too large in the last few frames. AI-EEE Intermational Conference on Computer Vision.termative optimization methods which simultaneouslyWashington DC, USA: IEEE Computer Society, 2005, 2:consider correspondences in all fames may improve the1308-1315.results, but these would be slower and the improvement8] Khan S, Shah M. Objeet based segmentation of video us-might not be worth the extra time spent.ing color motion and spatial information. In: ProceedingsReferencesof IEEE Conference on Computer Vision and Patterm Rec-ognition. Hawai, USA, 2001, 2: 746-751.[1] WangJ, Xu Y, Shum H Y, et al. Video tooning. ACM[9] Felzenszwalb P F, Huttenlocher D P. Efficient graph-basedTransactions on Graphics, 2004, 23(3): 574-583.image segmentation. International Journal of Computer2] Zhang Y, Xiao J, Shah M. Motion layer based object re-Vision, 2004. 59(2): 167-181.moval in videos. In: Proceedings of the Seventh IEEE[10] Comanicu D, Meer P. Mean shift: A robust approach to-Workshops on Application of Computer Visionward feature space analysis. IEE Transactions on Pattern(WACV/MOTION'05). Washington DC， USA: IEEEAnalysis and Machine Ineligence, 2002, 24(5): 603-619.Computer Society, 2005, 1: 516-521.[I]中国煤化im Diego, USA: Aca-'THCNMHG

论文截图