home / Problems/ The enterprise data model includes elements. Relational data model

The enterprise data model includes elements. Relational data model

This article will focus on data warehouse architecture. What should be guided by when constructing it, which approaches work - and why.

"The tale is a lie - but there is a hint in it ..."

Grandfather planted ... storage. And the storehouse has grown, great, great. I just didn't really know how it worked. And grandfather started a review. The grandfather called the grandmother, granddaughter, cat and mouse to the family council. And he says the following: “Our storage has grown. Data from all systems flows down, tables are visible and invisible. Users concoct their reports. Everything seems to be good - to live and live. Yes, only one sadness - no one knows how it works. Requires disks apparently-invisibly - you can't save enough! And then the users got into the habit of coming to me with different complaints: either the report freezes, or the data is outdated. And then it’s quite a disaster - we come with reports to the tsar-father, but the numbers do not agree with each other. The hour is not even - the king is angry - then do not take off the head - neither for me, nor for you. So I decided to gather you together and consult: what are we going to do? "

He cast his gaze over the meeting and asks:
- You, grandma, do you know how our storage is arranged?
- No, grandfather, I don’t know. And how would I know? Over there, what brave lads are guarding him! Some of them! You will not approach. I went to see them somehow, baked pies. And they ate the pies, wiped their mustache and say: “Why did you come, grandma? What kind of storage are you? You tell me what kind of report you need - we will do it for you! You should bring the pies more often! Painfully, they are delicious. "
- And you, beloved granddaughter, do you know how our storage is arranged?
- No, grandpa, I don’t know. They gave me access to it somehow. I connected, I look - and there are tables - apparently invisible. And different schemes are hidden. Eyes run up…. At first I was confused. And then I looked closely - some of them are empty, others are full, but only half. And the data seems to be repeated. It is no wonder that you will not be enough of disks, with such redundancy!
- Well, you, cat, what do you say about our storage facility? Is there anything good about it?
- Yes, how not to say, grandfather - I will. At my granddaughter's request, I tried to make a pilot in a separate circuit - a small showcase. In order to understand what kind of trade is profitable for our state - what products are good for merchants, they pay tribute - they replenish the treasury. And which ones are very bad. And I began to select data for myself from this repository. Collected facts. And he began to try to compare them against products. And what, grandfather, I saw - the products seem to be the same, but you look at the plates - they are different! Then I began to comb them with my granddaughter's comb. Chesal-scratched - and led to a certain uniformity, caressing the eyes. But early on I rejoiced - the next day I launched my scripts to update the wonderful data in the window - and everything was gone for me! "How so?" - I think - the granddaughter will be upset - today it would be necessary to show our pilot to the minister. How do we go with such data?
- Yes, sad tales, cat, you tell. Well, you, little mouse, really didn’t try to find out about the storage? You are a lively, nimble, sociable girl with us! What will you tell us?
- Yes, how, grandfather, do not try - of course, I am a quiet mouse, yes nimble. Once the cat's granddaughter asked me to get the data model of our state storage. And the cat, of course, came to me - for you, he says, the mouse, all hope! Well, what's a good deed for good people (and cats) not to do? I went to the castle, where the head of the warehouse hides the data model in the safe. And she hid. I waited for him to take that model out of the safe. As soon as he went out for coffee, I jumped onto the table. I look at the model - I can't understand anything! How so? I don’t recognize our storage! We have countless thousands of tables, data streams are irrepressible! And here - everything is harmonious and beautiful ... He looked at this very model - and put it back into the safe.
- Yes, very strange things, you told us, mouse.
The grandfather thought hard.
- What are we going to do, my friends? After all, with such and such a repository you will not live long ... Users will soon lose their patience.

Whatever our grandfather decided from a fairy tale - to build a new storage facility or try to reanimate an existing one - it is necessary to draw conclusions before "rolling up our sleeves" again.
Let's put aside the organizational aspects - such as the danger of concentration of expertise in a certain narrow closed group, the absence of control processes and ensuring the transparency of the architecture of the systems used in the enterprise, etc.
Today I would like to focus on building the architecture of a specific system (or group of systems) - data warehouses. What needs to be kept in focus first of all, when an organization starts building such a complex and expensive system as a storage.

Debriefing

None of us, working on the creation and development of any system, does not want this to be a "temporary house", or a solution that will "wither away" in a year or two, because will be unable to meet the requirements and expectations of the Customers and the Business. No matter how strong the bias towards "flexible methodologies" is observed today, it is much more pleasant for a person to feel like a "master" who makes violins than an artisan who planes sticks for disposable drums.
Our intention sounds natural: to make systems that are solid and high-quality, which will not require us to have regular "night vigils with a file", for which we will not be ashamed in front of end users and which will not look like a "black box" for all "uninitiated" followers.

To begin with, let's throw in a list of typical problems that we regularly encounter when working with repositories. Let's just write down what we have - so far without trying to streamline and formalize.

In principle, we have a good storage: if you leave it alone, then everything works. True, as soon as a change is required, “local collapses” begin.
Data are uploaded daily, according to the regulations, within one large process, within 8 hours. And it suits us. But if suddenly a failure occurs, it requires manual intervention. And then everything can work unpredictably for a long time, tk. will require human participation in the process.
Have rolled up the release - expect problems.
Some one source could not send data on time - all processes are waiting.
The integrity of the data is controlled by the database - so our processes crash when it breaks.
We have a very large storage - 2000 tables in one common schema. And 3000 more in many other schemes. We already have little idea of how they are arranged and for what reason they appeared. Therefore, it can be difficult for us to reuse something. And many tasks have to be solved anew. Because, this is easier and faster (than to understand "someone else's code"). As a result, we have discrepancies and duplicated functionality.
We expect the source to provide good quality data. But it turns out that this is not the case. As a result, we spend a lot of time reconciling our final reports. And they were very successful in this. We even have a streamlined process. True, it takes time. But users are used to ...
The user does not always trust our reports and requires a justification of one figure or another. In some cases he is right, and in others he is not. But it is very difficult for us to justify them, since we have no means of "end-to-end analysis" (or data lineage).
We could bring in additional developers. But we have a problem - how do we include them in the work? What is the most efficient way to parallelize jobs?
How to develop the system gradually, without going into the development of the "core of the system" for a whole year?
The data warehouse is associated with the corporate model. But we know for sure (we saw it in bank XYZ) that building a model can be infinitely long (we went to bank XYZ for six months and discussed business entities, without any movement). Why is she at all? Or maybe it's better without her, if there are so many problems with her? Maybe we can generate it somehow?
We decided to drive the model. But how do you systematically evolve the warehouse data model? Do we need “rules of the game” and what might they be? What will it give us? What if we are wrong with the model?
Should we save the data, or the history of its changes, if "the business does not need it"? I would not want to "store garbage" and complicate the use of this data for real tasks. Should the vault keep history? What is it like? How does storage work over time?
Should we try to unify the data on the storage if we have a master data management system? If there is MDM, does this mean that the whole problem with the master data is now solved?
We are expected to replace key accounting systems soon. Does the datastore need to be ready to change source? How can this be achieved?
Do we need metadata? What do we mean by this? Where exactly can they be used? How can you implement it? Do I need to store them "in one place"?
Our Customers are extremely unstable in their requirements and desires - something is constantly changing. In general, our business is very dynamic. While we are doing something, it already becomes unnecessary. How can we do it in such a way as to give the result as quickly as possible - like hot cakes?
Users demand responsiveness. But we cannot run our main boot processes often, because this loads the source systems (has a bad effect on performance) - therefore, we hang up additional data streams - which will pick up pointwise - what we need. True, there are many streams. And then we will discard some of the data. Moreover, there will be a convergence problem. But there is no other way ...

Quite a lot has already happened. But this is not a complete list - it is easy to supplement and develop it. We will not hide it in the table, but hang it in a conspicuous place - keeping these issues in the focus of our attention in the process of work.
Our task is to come up with a comprehensive solution as a result.

Antifragility

Looking at our list, one conclusion can be drawn. It is not difficult to create a kind of "database for reporting", upload data there, or even build some kind of routine data update processes. The system starts somehow to live, users appear, and with them obligations and SLA, new requirements arise, additional sources are connected, methodologies change - all this must be taken into account in the development process.

After some time, the picture is as follows:
“Here is the vault. And it works if you don't touch it. Problems arise when we have to change something. "

A change arrives at us, the influence of which we are not able to assess and comprehend (since we did not put such tools into the system from the very beginning) - and in order not to take risks, we do not touch what is, but we make one more extension on the side, and another, and also - turning our decision into slums, or, as they say in Latin America, "favelas", where even the police are afraid to enter.
There is a feeling of loss of control over one's own system, chaos. It takes more and more hands to maintain existing processes and solve problems. And changes are getting harder and harder to make. In other words, the system becomes unstable to stress, maladaptive to changes. And besides, there is a strong dependence on characters who "know the fairway", since no one has a "map".

This property of an object - to collapse under the influence of chaos, random events and shocks - Nassim Nicholas Taleb calls fragility ... And also introduces the opposite concept: antifragility – when the object does not collapse from stress and accidents, but directly benefits from it... ("Antifragility. How to Benefit from Chaos")
Otherwise it can be called adaptability or resilience to change .

What does this mean in this context? What are the “sources of chaos” for IT systems? And what does it mean to “capitalize on chaos” in terms of IT architecture?
The first thought that comes to mind is changes that come from the outside. What is the outside world for the system? For storage in particular. Of course, first of all - changes from the side of data sources for the store:

changing the formats of incoming data;
replacement of some data source systems with others;
change of rules / platforms for systems integration;
changing the interpretation of data (formats are saved, the logic of working with data changes);
changing the data model if the integration is done at the data level (parsing the database transaction log files);
growth in data volumes - while there was not much data in the source system, and the load was not high - it was possible to retrieve it at any time, with an arbitrarily heavy request, the data and load increased - now there are strict restrictions;
etc.

The source systems themselves, the composition of information and its structure, the type of integration interaction, as well as the very logic of working with data can change. Each system implements its own data model and approaches to working with them, which meet the goals and objectives of the system. And no matter how hard they try to unify industry models and reference practices, nuances will inevitably emerge. (And besides, the process of industry unification itself, for various reasons, is not making much progress.)
The culture of working with corporate data - the presence and control of information architecture, a unified semantic model, master data management systems (MDM) somewhat facilitate the task of consolidating data in the warehouse, but do not exclude its need.

No less critical changes are initiated by the warehouse consumers (requirements change):

previously, there was enough data to build a report - now it was required to connect additional fields or a new data source;
previously implemented data processing techniques are outdated - you need to rework the algorithms and everything that affects it;
Previously, everyone was satisfied with the current value of the dictionary attribute on the information panel - now the value is required that is relevant at the time of the analyzed fact / event;
there was a requirement for the depth of the data storage history, which did not exist before - to store data not for 2 years, but for 10 years;
previously, there was enough data as of the “end of the day / period” - now you need the state of the data “within the day”, or at the time of a certain event (for example, a decision on a loan application - for Basel II);
earlier we were satisfied with reporting on data for yesterday (T-1) or later, now we need T0;
etc.

Both the integration interactions with the source systems and the requirements from the consumers of the warehouse data are external factors for the data warehouse: some source systems replace others, data volumes grow, the formats of incoming data change, user requirements change, etc. And all these are typical external changes for which our system - our repository - must be ready. With the right architecture, they shouldn't kill the system.

But that is not all.
Speaking about variability, we, first of all, remember external factors. After all, inside we can control everything, it seems to us so, right? Yes and no. Yes, most of the factors that are outside the zone of influence are external. But there is also “internal entropy”. And precisely because of its presence, we sometimes need to return “to point 0”. Start the game over.
In life, we often tend to start from scratch. Why is this peculiar to us? And is it really that bad?
Applied to IT. For the system itself - this can be very good - the ability to rethink individual decisions. Especially when we can do it locally. Refactoring is the process of unraveling the “web” that periodically appears in the process of system development. Going back to the beginning can be helpful. But it has a price.
With competent architecture management, this price decreases - and the system development process itself becomes more controllable and transparent. A simple example: if the principle of modularity is observed, you can rewrite a separate module without affecting the external interfaces. And this cannot be done with a monolithic structure.

The antifragility of a system is determined by the architecture that is embedded in it. And it is this property that makes it adaptive.
When we talk about adaptive architecture- we mean that the system is able to adapt to changes, and not at all that we are constantly changing the architecture itself. On the contrary, the more stable and stable the architecture, the fewer the requirements that entail its revision, the more adaptive the system.

Solutions involving a revision of the entire architecture will have a much higher price. And you need to have very good reasons for their adoption. For example, such a rationale might be a requirement that cannot be implemented within the existing architecture. Then they say - a requirement has appeared that affects the architecture.
Thus, we also need to know our “limits of antifragility”. Architecture is not developed “in a vacuum” - it is based on current requirements and expectations. And if the situation changes fundamentally - we must understand that we have gone beyond the current architecture - and we need to revise it, work out a different solution - and think over the transition paths.
For example, we assumed that we will always need data in the storage at the end of the day, we will take data every day using standard system interfaces (through a set of views). Then from the risk management department came the demand for the need to receive data not at the end of the day, but at the time of the decision on lending. No need to try to “pull the non-tensioned” - you just need to admit this fact - the sooner the better. And start working on an approach that will allow us to solve the problem.
There is a very fine line here - if we take into account only the "requirements in the moment" and do not look several steps ahead (and several years ahead), then we increase the risk of facing a requirement affecting architecture too late - and the price of our change will be very high. Looking a little ahead - within the boundaries of our horizon - has not hurt anyone yet.

The example of a system from the "storage tale" is just one example of a very shaky system built on fragile design approaches. And if this happens, destruction occurs quite quickly, for this particular class of systems.
Why can I say so? The topic of repositories is not new. The approaches and engineering practices that have been developed during this time were aimed precisely at this - maintaining the viability of the system.
A simple example: One of the most common reasons for the failure of take-off storage projects is trying to build storage over development source systems without agreeing on integration interfaces — trying to fetch data directly from tables. As a result, we went into development - during this time the source database changed - and the loading streams in the repository became inoperative. It's too late to redo something. And if you have not yet secured yourself by making several layers of tables inside the storage, then you can throw everything out and start over. This is just one example, and one of the simple ones.

The Taleb criterion for fragile and antifragile is simple. The main judge is time. If the system withstands the test of time, and shows its "vitality" and "indestructibility" - it has the property of antifragility.
If, when designing a system, we take into account antifragility as a requirement, this will encourage us to use such approaches to building its architecture that will make the system more adaptable both to “chaos from the outside” and to “chaos from within”. And ultimately the system will have a longer lifespan.
None of us wants to make "makeshift houses". And do not deceive yourself, which is no other way today. It is normal for a person to look a few steps ahead at any time, especially during a crisis.

What is a data warehouse and why are we building it

The article on storage architecture assumes that the reader not only knows what it is, but also has some experience with such systems. Nevertheless, I considered it necessary to do this - to return to the origins, to the beginning of the path, because it is there that the "fulcrum" of development is located.

How did people come to the idea that data warehouses are needed? And how are they different from just a "very large database"?
A long time ago, when there were simply "business data processing systems" in the world, there was no division of IT systems into such classes as front-end oltp systems, back-office dss, text processing systems, data warehouses, etc.
It was during this time that the first relational database engine, Ingres, was created by Michael Stonebreaker.
And that was the time when the era of personal computers burst into the computer industry like a whirlwind and forever changed all the ideas of the IT community of that time.

At that time it was easy to find enterprise applications written on the basis of desktop class DBMSs, such as Clipper, dBase, and FoxPro. And the market for client-server applications and DBMS was only gaining momentum. Database servers appeared one after another, which will occupy their niche in the IT space for a long time - Oracle, DB2, etc.
And the term "database application" has been common. What did such an application include? Simplified - some input forms through which users could simultaneously enter information, some calculations that were launched "by button" or "on schedule", as well as some reports that could be seen on the screen or saved as files and sent to seal.
“Nothing special - just a regular application, just a database,” one of my mentors remarked early on in my career. "So nothing special?" - I thought then.

If you look closely, there are still some peculiarities. As users grow, the volume of incoming information increases, as the load on the system increases, its developers, designers, in order to maintain performance at an acceptable level, go to some "tricks". The very first is the division of a monolithic "business data processing system" into an accounting application that supports users' on-line work, and an application for batch processing of data and reporting is separately allocated. Each of these applications has its own database and is even hosted on a separate instance of the database server, with different settings for different types of load - OLTP and DSS. And data streams line up between them.

It's all? It would seem that the problem has been solved. What happens next?
And then the companies grow, their information needs multiply. The number of interactions with the outside world is also growing. And as a result, there is not one large application that fully automates all processes, but several different ones from different manufacturers. The number of systems that generate information - data source systems in the company is increasing. And sooner or later, there will be a need to see and compare the information received from different systems. This is how data warehouses appear in the company - a new class of systems.
The generally accepted definition of this class of systems is as follows.

Data Warehouse (or Data Warehouse)- a subject-oriented information database specially designed and designed for the preparation of reports and business analysis in order to support decision-making in an organization

Thus, consolidation data from different systems, the ability to look at them in a certain "uniform" (unified) way - this is one of the key properties of systems of the class of data warehouses. This is the reason why repositories have emerged during the evolution of IT systems.

Key features of data warehouses

Let's take a closer look. What are the key features of these systems? What makes data warehouses different from other enterprise IT systems?

First, these are large volumes. Very big. VLDB - this is how leading vendors call such systems when they give their recommendations on the use of their products. From all systems of the company, data flows into this large database and is stored there "forever and unchanged", as they say in textbooks (in practice, life turns out to be more complicated).

Second, this is historical data - "Corporate memory" - so called data warehouses. In terms of working with time in repositories, everything is quite interesting. In accounting systems, the data is up-to-date at the moment. Then the user performs some kind of operation - and the data is updated. At the same time, the history of changes may not be saved - it depends on the accounting practice. Take a bank account balance, for example. We may be interested in the current balance at "now", at the end of the day, or at the time of some event (for example, at the time of calculating the score). While the first two are fairly easy to solve, the latter will most likely require special efforts. The user, working with the storage, can refer to the past periods, compare them with the current one, etc. It is these time-related capabilities that significantly distinguish data warehouses from accounting systems - obtaining the state of data at various points on the time axis - to a certain depth in the past.

Third, it is consolidation and data unification ... In order for their joint analysis to become possible, it is necessary to bring them to a common form - unified data model , compare facts with unified reference books. There can be several aspects and difficulties here. First of all - conceptual - under the same term, different people from different departments can understand different things. And vice versa - to call something differently, which is essentially the same thing. How to provide a “single view” while maintaining the specific vision of a particular user group?

Fourth, this is work with data quality ... In the process of loading data into the storage, they are cleaned, general transformations and transformations are performed. General transformations must be done in one place - and then used to build various reports. This will avoid the inconsistencies that annoy business users — especially executives who are brought to the table with numbers from different departments that do not agree with each other. Poor data quality creates errors and discrepancies in reports, the consequence of which is a decrease in the level user trust to the entire system, to the entire analytical service as a whole.

Architectural concept

Anyone who has come across a repository has most likely observed some kind of "layered structure" - since it is this architectural paradigm that has taken root for systems of this class. And it is no coincidence. Storage layers can be perceived as separate components of the system - with their own tasks, area of responsibility, "rules of the game."
Layered architecture is a means of dealing with the complexity of the system - each subsequent level is abstracted from the complexities of the internal implementation of the previous one. This approach allows you to single out tasks of the same type and solve them in a uniform way, without re-inventing the “wheel” from scratch every time.
The conceptual architectural diagram is schematically shown in the figure. This is a simplified diagram that reflects only the key idea - the concept, but without the "anatomical details" that would arise with deeper elaboration of details.

As shown in the diagram, conceptually select the following layers. The three main layers that contain the data storage area (indicated by the filled rectangle) and the data loading software (conventionally indicated by the arrows of the same color). And also an auxiliary - service layer, which, however, plays a very important connecting role - data load management and quality control.

Primary Data Layer - primary data layer (or staging , or operating layer ) - designed to load from source systems and save primary information, without transformations - in original quality and support a complete history of changes.
The task of this layer- to abstract the subsequent layers of storage from the physical structure of data sources, methods of data collection and methods of separating the delta of changes.

Core Data Layer - core storage - the central component of the system that distinguishes the storage from just a "batch integration platform" or "big data dump", since its main role is data consolidation from different sources, reduction to uniform structures, keys. It is when loading into the kernel that the main work is done with data quality and general transformations, which can be quite complex.
The task of this layer- to abstract their consumers from the features of the logical device of data sources and the need to compare data from different systems, to ensure the integrity and quality of data.

Data Mart Layer - analytical showcases - a component, the main function of which is to convert data to structures that are convenient for analysis (if BI works with showcases, then this is, as a rule, a dimensional model), or according to the requirements of the consumer system.
As a rule, data marts take data from the core - as a reliable and verified source - i.e. use the service of this component to bring data to a single form. We will call such showcases regular ... In some cases, storefronts can take data directly from the staging - operating with primary data (in the source keys). This approach is usually used for local tasks where data consolidation from different systems is not required and where efficiency is needed more than data quality. Such display cases are called operating ... Some analytical indicators can have very complex calculation methods. Therefore, for such non-trivial calculations and transformations, the so-called secondary display cases .
Showcase Layer Task- preparation of data according to the requirements of a specific consumer - a BI platform, a group of users, or an external system.

The layers described above consist of a persistent data storage area, as well as a software module for loading and transforming data. This division into layers and regions is logical. Physically, the implementation of these components can be different - you can even use different platforms for storing or transforming data on different layers, if this is more efficient.
Storage areas contain technical (buffer tables) that are used in the process of data transformation and target tables that the consuming component is referring to. It is good practice to "cover" the target tables with views. This facilitates the subsequent maintenance and development of the system. The data in the target tables of all three layers are marked with special technical fields (meta-attributes), which are used to support data loading processes, as well as to enable informational audit of data flows in the warehouse.

Also, a special component (or a set of components) is distinguished, which provides service functions for all layers. One of its key tasks is the control function - to provide "uniform rules of the game" for the entire system as a whole, leaving the right to use various options for implementing each of the layers described above - incl. use different technologies for loading and processing data, different storage platforms, etc. Let's call it service layer ... It does not contain business data, but it has its own storage structures - it contains a metadata area, as well as an area for working with data quality (and possibly other structures, depending on the functions assigned to it).

Such a clear division of the system into separate components significantly increases the controllability of the development of the system:

the complexity of the task that is posed to the developer of the functionality of this or that component is reduced (he should not simultaneously solve the issues of integration with external systems, and think over data cleansing procedures, and think about the optimal presentation of data for consumers) - the task is easier to decompose, evaluate and perform a small delivery;
you can connect to the work of various performers (and even teams, or contractors) - because this approach allows you to effectively parallelize tasks, reducing their mutual influence on each other;
the presence of persistent staging allows you to quickly connect data sources without designing the entire core or storefronts for the entire subject area, and then gradually finish building the remaining layers according to priorities (moreover, the data will already be in the storage - available to system analysts, which will greatly facilitate the tasks of the subsequent development of the storage);
the presence of a core allows all work with data quality (as well as possible mistakes and errors) to be hidden from storefronts and from the end user, and most importantly - using this component as a single data source for storefronts, you can avoid data convergence problems due to the implementation of common algorithms in one place;
the selection of marts allows you to take into account the differences and the specifics of understanding the data that users of different departments may have, and their design for BI requirements allows not only to issue aggregated figures, but to ensure data validation by providing opportunities to drill down to primary indicators;
the presence of a service layer allows you to perform end-to-end data analysis (data lineage), use unified data audit tools, general approaches to highlighting the delta of changes, work with data quality, load management, monitoring and error diagnostics tools, and accelerates problem resolution.

This approach to decomposition also makes the system more resistant to change (in comparison with the "monolithic structure") - it ensures its antifragility:

changes on the part of source systems are processed at staging - in the kernel, only those streams that are influenced by these staging tables are modified, the effect on storefronts is minimal or absent;
changes in requirements on the part of consumers are processed for the most part in storefronts (if this does not require additional information that is not yet in the store).

Next, we will go through each of the components presented above and take a look at them in a little more detail.

System core

Let's start from the middle - the core of the system or the middle layer. It is labeled as Core Layer. The kernel plays the role of data consolidation - bringing to uniform structures, reference books, keys. This is where the main work with data quality is carried out - cleaning, transformation, unification.

The presence of this component allows you to reuse data streams that transform the primary data received from source systems into a certain unified format, following general rules and algorithms, and not repeat the implementation of the same functionality separately for each application storefront, which, in addition to inefficient use of resources, may entail also discrepancies in the data.
The core of the repository is implemented in a data model, in the general case, which is different both from the models of source systems and from the formats and structures of consumers.

Warehouse Core Model and Enterprise Data Model

The main concern of the middle storage layer is stability. That is why the main focus here is on the data model. It is commonly referred to as the "corporate data model". Unfortunately, a kind of aura of myths and absurdities has formed around it, which sometimes lead to a refusal to build it altogether, but in vain.

Myth 1. An enterprise data model is a huge model with thousands of entities (tables).
Actually. In any subject area, in any business domain, in the data of any company, even the most complex, there are few basic entities - 20-30.

Myth 2. There is no need to develop any "own model" - we buy an industry reference model - and we do everything according to it. We spend money - but we get a guaranteed result.
Actually. Reference models can indeed be very useful because contain industry experience in modeling this area. From them you can glean ideas, approaches, naming practices. Check the "depth of coverage" of the area so that something important is not overlooked. But we are unlikely to be able to use such a model out of the box - as it is. This is the same myth as, for example, the purchase of an ERP system (or CRM) and its implementation without any "tightening up for yourself." The value of such models is born in their adaptation to the realities of this particular business, this particular company.

Myth 3. The development of a core repository model can take many months, during which time the project will actually be frozen. Plus, it requires an insane amount of meetings and lots of people.
Actually. The repository model can be developed with the repository iteratively, piece by piece. For uncovered areas, "expansion points" or "stubs" are set. some "universal designs" are applied. At the same time, you need to know when to stop so that you do not get a super-universal thing of 4 tables, into which it is difficult to both "put data" and (even more difficult) to get it. And which is extremely suboptimal in terms of performance.

It really takes time to develop the model. But this is not the time spent on "drawing entities" - this is the time required to analyze the subject area, understanding how the data is arranged. That is why analysts are very closely involved in this process, and various business experts are also involved. And this is done pointwise, selectively. And not by organizing meetings with the participation of an insane number of people, sending out huge questionnaires, etc.
Good business and systems analysis is key in building a core warehouse model. There is a lot to understand: where (in what systems) data is generated, how it works, in what business processes it circulates, etc. Qualitative analysis has never harmed a single system. Rather, on the contrary, problems arise from “white spots” in our understanding.

Developing a data model is not a process of inventing and inventing something new. In fact, the data model already exists in the company. And the design process is more like “excavation”. The model is carefully and carefully extracted from the "soil" of corporate data and put on a structured form.

Myth 4. Our business is so dynamic in our company, and everything is changing so quickly that it is useless for us to make a model - it will become outdated before we put this part of the system into operation.
Actually. Recall that the core factor is stability. And above all, the topology of the model. Why? Because it is this component that is central and influences everything else. Stability is also a requirement for the kernel model. If a model becomes outdated too quickly, then it is incorrectly designed. The wrong approaches and “rules of the game” were chosen for its development. And it is also a matter of qualitative analysis. The key entities of the corporate model rarely change.
But if it occurs to us to make for a company that sells, say, confectionery, instead of the “Products” directory, make “Sweets”, “Cakes” and “Pies”. Then when pizza appears in the list of goods - yes, you will need to enter a lot of new tables. And this is just a matter of approach.

Myth 5. The creation of a corporate model is a very serious, complex and responsible business. And it's scary to make a mistake.
Actually. The core model, although it should be stable, is still not “cast in metal”. Like any other design solution, its structure can be revised and modified. You just do not need to forget about this quality of it. But this does not mean at all that “you cannot breathe on it”. And this does not mean that temporary solutions and "stubs" that should be planned for recycling are unacceptable.

Myth 6. If our data source is, for example, a reference data system (or a master data management system - MDM), then it should already correspond to the corporate model in an amicable way (especially if it was recently designed and did not have time to acquire a "side", "traditions "And temporary huts). It turns out that for this case - we do not need a kernel model?
Actually. Yes, in this case, building the core model of the repository is greatly facilitated - since we follow a ready-made top-level conceptual model. But it is not excluded at all. Why? Because when building a model of a certain system, some of its own rules apply - what types of tables to use (for each entity), how to version the data, with what granularity to keep the history, what meta-attributes (technical fields to use), etc.

In addition, no matter how wonderful and all-embracing system of reference data and MDM we have, as a rule, there will be nuances associated with the existence of local directories "about the same" in other accounting systems. And this problem, whether we want it or not, will have to be solved at the repository - after all, reporting and analytics are collected here.

Primary data layer (or historicized staging or operational layer)

On it is designated as Primary Data Layer. The role of this component: integration with source systems, loading and storing primary data, as well as preliminary data cleaning - checking for compliance with the rules of format-logical control, fixed in the "interaction interface agreement" with the source.
In addition, this component solves a very important problem for the repository - allocating the "true delta of changes" - regardless of whether the source allows you to track changes in the data or not and how (by what criterion they can be "caught"). As soon as the data got into staging - for all other layers, the issue of delta allocation is already clear - thanks to the labeling with meta-attributes.

The data in this layer is stored in structures as close as possible to the source system - in order to preserve the primary data as close as possible to their original form. Another name for this component is "operating layer".
Why not just use the well-established term “staging”? The fact is that earlier, before the "era of big data and VLDB", disk space was very expensive - and often the primary data, if preserved, was only for a limited period of time. And often the name "staging" is called cleanable buffer.
Now technologies have stepped forward - and we can afford not only to store all the primary data, but to historicize them with the degree of granularity that is possible. This does not mean that we should not control the growth of data and does not eliminate the need to manage the lifecycle of information, optimizing the cost of data storage, depending on the "temperature" of use - ie. taking "cold data" that is less in demand to cheaper media and storage platforms.

What does the presence of "historicized staging" give us:

the possibility of making mistakes (in structures, in transformation algorithms, in the granularity of history) - having completely historicized primary data in the availability zone for the storage, we can always reload our tables;
an opportunity to think - we can take our time to work out a large fragment of the kernel in this particular iteration of the storage development, since in our staging, in any case, there will be, and with an even time horizon (there will be one point of "history reference");
the possibility of analysis - we will save even those data that are no longer in the source - they could get overwritten there, go to the archive, etc. - with us, they remain available for analysis;
the possibility of information audit - thanks to the most detailed primary information, we can then figure out how the download worked for us, that we ended up with such figures (for this, we also need to have marking with meta attributes and the corresponding metadata on which the download works - this is decided by the service layer).

What difficulties can arise when building a "historicized staging":

it would be convenient to set requirements for the transactional integrity of this layer, but practice shows that this is difficult to achieve (this means that in this area we do not guarantee the referential integrity of parent and child tables) - integrity alignment occurs on subsequent layers;
this layer contains very large volumes (the most voluminous on the storage - despite all the redundancy of analytical structures) - and you need to be able to handle such volumes - both in terms of load and in terms of requests (otherwise, you can seriously degrade the performance of the entire storage).

What else is interesting to say about this layer.
Firstly, if we move away from the paradigm of “end-to-end loading processes”, then the rule “the caravan moves at the speed of the last camel” no longer works for us, more precisely, we abandon the “caravan” principle and switch to the “conveyor” principle: we took data from source - put in your layer - ready to take the next portion. It means that
1) we do not wait for the processing to happen on other layers;
2) we are not dependent on the schedule for the provision of data by other systems.
Simply put, we schedule a loading process that takes data from one source through a specific way of connecting to it, checks, allocates delta - and puts the data into target staging tables. And that's all.

Secondly, these processes, as you can see, are very simple - one might say trivially, from the point of view of logic. This means that they can be very well optimized and parameterized, reducing the load on our system and speeding up the process of connecting sources (development time).
For this to happen, you need to know very well the peculiarities of the technological features of the platform on which this component works - and then you can make a very effective tool.

Showcase layer

The Data Mart Layer is responsible for preparing and providing data to end users - people or systems. At this level, the requirements of the consumer are taken into account as much as possible - both logical (conceptual) and physical. The service should provide exactly what is needed - no more, no less.

If the consumer is an external system, then, as a rule, it dictates the data structures that it needs and the rules for the collection of information. A good approach is one in which the consumer is responsible for the correct data collection. The data warehouse prepared, formed a showcase, provided the possibility of incremental data collection (marking with meta-attributes for the subsequent highlighting of the delta of changes), and the consumer system then itself controls and is responsible for how it uses this showcase. But there are some peculiarities: when the system does not have an active component for data collection - either an external component is needed that will perform the integrating function, or the storage will act as an "integration platform" - and will ensure the correct incremental data upload further - outside the storage. Many nuances emerge here, and the rules of interface interaction must be thought out and understood by both parties (however, as always, when it comes to integration). As a rule, routine data cleansing / archiving is applied to such data marts (it is rarely necessary that this "transit data" be stored for a long time).

The most important from the point of view of analytical tasks are showcases "for people" - more precisely, for the BI tools with which they work.
However, there is a category of "especially advanced users" - analysts, data researchers - who do not need BI tools or regulatory processes for filling external specialized systems. They need some kind of "common storefronts" and "their own sandbox", where they can create tables and transformations at their discretion. In this case, the responsibility of the repository is to ensure that these common storefronts are filled with data in accordance with the regulations.
Separately, we can highlight such consumers as Data Mining tools - deep data analysis. These tools have their own data preparation requirements, and data scientists also work with them. For the storage, the task comes down to - again, to support the service for loading some storefronts of the agreed format.

However, back to the analytical showcases. These are the ones that are of interest from the point of view of storage designers in this data layer.
In my opinion, the best time-tested approach to designing data marts, to which almost all BI platforms are now "sharpened", is the approach of Ralph Kimball. It is known as dimensional modeling - multidimensional modeling. There are many publications on this topic. For example, the basic rules can be found in the publication. And of course, you can recommend from the multidimensional modeling guru. Another helpful resource is Kimball's Tips
The multidimensional approach to the creation of storefronts is described and worked out so well - both by the "method evangelists" and by the leading software vendors, that there is no point in dwelling on it in some detail here - the original source is always preferable.

I would like to make only one emphasis. "Reporting and analytics" is different. There is “heavy reporting” - pre-ordered reports that are generated in the form of files and delivered to users through the provided delivery channels. And then there are dashboards - BI dashboards. At their core, these are web applications. And the response time of these applications is the same as for any other web application. This means that the normal time for a BI Panel to refresh is seconds, not minutes. It is important to keep this in mind when designing your solution. How can this be achieved? The standard optimization method: we look at what the response time is composed of and what we can influence. What is the most time wasted? For physical (disk) database readings, for data transmission over the network. How to reduce the amount of data read and transmitted in one request? The answer is obvious and simple: you need to either aggregate the data, or apply a filter to the large tables of the actual tables participating in the query, and exclude joining of large tables (references to fact tables should only go through dimensions).

What is BI for? How is it convenient? Why is the multidimensional model effective?
BI allows the user to run what are called ad hoc queries. What does it mean? This means that we do not know the exact request in advance, but we know which indicators in which aspects the user can request. The user generates such a query by selecting the appropriate BI filters. And the task of the BI developer and the storefront designer is to provide such a logic of the application so that the data is either filtered or aggregated, preventing a situation when too much data is requested - and the application "hangs". Usually, they start with aggregated numbers, then delve deeper into more detailed data, but along the way, install the necessary filters.

It is not always enough to just build the “right star” and get a convenient structure for BI. Sometimes you will need to apply denormalization somewhere (while looking back at how this will affect the load), and somewhere to make secondary storefronts and aggregates. Add indexes or projections somewhere (depending on the DBMS).

Thus, through "trial and error", you can get a structure that is optimal for BI - which will take into account the peculiarities of both the DBMS and the BI platform, as well as the user's requirements for data presentation.
If we take data from the "core", then such processing of storefronts will be of a local nature, without affecting in any way the complex processing of primary data obtained directly from source systems - we only "shift" the data into a format convenient for BI. And we can afford to do this many times, in different ways, in accordance with different requirements. It is much easier and faster to do this on kernel data than to collect from the "primary" (the structure and rules of which, as we know, can also "float").

Service layer

The Service Layer is responsible for the implementation of general (service) functions that can be used to process data in various storage layers - load management, data quality management, problem diagnosis and monitoring tools, etc.
The presence of this level provides transparency and structured data flows in the storage.

This layer includes two data storage areas:

metadata area - used for the data loading control mechanism;
data quality area - for the implementation of off-line data quality checks (i.e. those that are not directly built into ETL processes).

You can arrange the download management process in different ways. One possible approach is this: we split the entire set of storage tables into modules. The module can include tables of only one layer. The tables included in each module are loaded in a separate process. Let's call it control process ... The start of the control process is set to its own schedule. The control process orchestrates calls to atomic processes, each of which loads one target table, and also contains some general steps.
Obviously, it is enough to simply divide the staging tables into modules - by source systems, or rather by their connection points. But for the kernel, this is already more difficult to do. there we need to ensure the integrity of the data, which means we need to take into account the dependencies. Those. there will be collisions that need to be resolved. And there are different methods to resolve them.

An important point in load management is to develop a consistent approach to error handling. Errors are classified according to their severity level. When a critical error occurs, the process must stop, and as soon as possible, because its occurrence indicates a significant problem that can lead to data corruption in the storage. Thus, load management is not only about starting processes, but also stopping them, as well as preventing untimely (by mistake) starting.

For the operation of the service layer, a special metadata structure is created. This area will store information about loading processes, loaded data sets, checkpoints that are used to maintain an increment (which process has read to which point) and other service information necessary for the functioning of the system.
It is important to note that all target tables in all layers are marked with a special set of meta-fields, one of which is the identifier of the process that updated this row. For tables within a repository, this process marking allows for a consistent way of subsequently highlighting the delta of changes. When loading data into the primary data layer, the situation is more complicated - the delta allocation algorithm for different loaded objects may be different. But the logic of processing the accepted changes and their rolling onto target tables for the core and storefronts is much more complicated than for staging, where everything is quite trivial - it is easy to parameterize and think over reusable standard steps (procedures).

I am not setting the task here to fully cover this topic - organizing the download - I just highlight the accents that are worth paying attention to.
This approach is just one of the options. It's pretty responsive. And its "conceptual prototype" was the Toyota conveyor and the just-in-time system. Those. here we are moving away from the widespread paradigm of exclusively "nightly data download", and we download in small portions during the day - as soon as the data is ready in various sources: what came - it was downloaded. At the same time, we have many parallel processes running. And the "hot tail" of fresh data will constantly "blink" - and even out over time. We must take into account such a feature. And, if necessary, form custom showcases with "slices", where everything is already holistic. Those. it is impossible to achieve both efficiency and consistency (integrity) at the same time. We need a balance - somewhere one thing is important, somewhere else.

It is imperative to provide logging and monitoring facilities. It is good practice to use typed events, where you can set different parameters and customize the notification system - subscribing to certain events. Because it is very important that when the intervention of the system administrator is required, he would know about it as early as possible and receive all the necessary diagnostic information. The logs can also be used to analyze post-facto problems, as well as to investigate incidents of system malfunctions, incl. data quality.

Designing and Maintaining Warehouse Data Models

Why is it important to pay attention to the design of data models when developing any system where a database is involved (and especially in a warehouse)? Why not just throw a set of tables anywhere - even in a text editor? Why do we need "these pictures"?
Oddly enough, even experienced developers ask such questions.
Actually, yes, nothing prevents you from sketching tables - and starting to use them. If ... if at the same time in the head (!) The developer has a coherent general picture of the structure that he is sculpting. What if there are several developers? What if someone else is using these tables? And what if time passes - a person leaves this area, and then returns to it again?

Can you figure it out without a model? In principle, you can. And to figure it out, and "figure out the pictures on a piece of paper", and "mop up - settle" the data. But it is much easier, clearer and faster to use a ready-made artifact - a data model. And also understand the "logic of its device" - ie. it would be nice to have general rules of the game.

And the most important thing is not even that. The most important thing is that when designing a model, we are forced (just without options!) To study the subject area more closely and deeply, the features of the data device and their use in various business cases. And those questions that we would have easily “pushed aside” as complex, “blurred” by throwing our signs, without trying to precisely design model - we will be forced to deliver and decide now, when analyzing and designing, and not later - when we will build reports and think about “how to reduce the incompatible” and “reinvent the wheel” every time.

This approach is one of those engineering practices that allows you to create antifragile systems. Since they are clearly arranged, transparent, convenient for development, and also their "fragility limits" are immediately visible - you can more accurately estimate the "scale of disaster" when new requirements appear and the time required for redesign (if needed).
Thus, the data model is one of the main artifacts that must be maintained during the development of the system. In an amicable way, it should be "on the table" of every analyst, developer, etc. - everyone who participates in system development projects.

Designing data models is a large and separate topic. There are two main approaches to storage design.
The approach works well for the kernel Entity-relationship - when a normalized (3NF) model is built on the basis of the study of the subject area, more precisely, its selected area. The same "corporate model" that was discussed above is at play here.

When designing showcases, it is suitable multidimensional model ... This approach fits well with the understanding of business users - because it is a model that is simple and convenient for human perception - people operate with understandable and familiar concepts of metrics (indicators) and sections by which they are analyzed. And this allows you to simply and clearly build the process of collecting requirements - we draw a set of "matrixes of sections and indicators", communicating with representatives of various departments. And then we bring it into one structure - the "analysis model": we form the "measurement bus" and define the facts that are defined on them. Along the way, we are working on hierarchies and aggregation rules.

Then it is very easy to go to the physical model, adding optimization elements taking into account the peculiarities of the DBMS. For example, for Oracle it would be partitioning, a set of indexes, etc. For Vertica, other techniques will be used - sorting, segmentation, sectioning.
Also, special denormalization may be required - when we deliberately introduce redundancy into the data, thanks to which we improve the query performance, but at the same time complicate data updating (since redundancy will need to be taken into account and maintained during the data loading process). Perhaps, in order to improve performance, we will also have to create additional aggregate tables, or use such additional DBMS features as projections in Vertica.

So, when modeling warehouse data, we actually solve several problems:

the task of building a conceptual (logical) model of the kernel - system and business analysis - researching the subject area, going into details and taking into account the nuances of "live data" and their use in business;
the task of building an analysis model - and then a conceptual (logical) storefront model;
the task of building physical models - data redundancy management, optimization taking into account the peculiarities of the DBMS for queries and data loading.

When developing conceptual models, we may not take into account the peculiarities of a particular DBMS, for which we are designing a database structure. Moreover, we can use one conceptual model to create several physical ones - for different DBMS.

Let's summarize.

A data model is not a collection of "pretty pictures", and the process of designing it is not a process of drawing them. The model reflects our understanding of the domain. And the process of compiling it is the process of studying and researching it. This is wasted time. And not at all to "draw and paint."
A data model is a design artifact, a way of exchanging information in a structured way between team members. To do this, it must be clear to everyone (this is provided by notation and explanation) and available (published).
The data model is not created once and frozen, but is created and developed in the process of system development. We set the rules for its development ourselves. And we can change them if we see - how to do it better, easier, more efficiently.
The data model (physical) allows you to consolidate and leverage a set of best practices aimed at optimization - i.e. use the techniques that have already worked for this DBMS.

Features of data warehouse projects

Let us dwell on the specifics of projects within the framework of which the company builds and develops data warehouses. And let's look at them from the point of view of the influence of the architectural aspect. Why is it important for such projects to build an architecture, and from the very beginning? And it is the presence of a well-thought-out architecture that gives flexibility to the data warehouse project, allows you to efficiently distribute work between performers, and also make it easier to predict the result and make the process more predictable.

Data warehouse is custom software

A data warehouse is always a "custom development", not a boxed solution. Yes, there are industry-specific BI applications that include a reference data model, pre-configured ETL processes from common sources (for example, ERP systems), a set of standard BI panels and reports. But in practice, storage is rarely implemented - as a "box". I have been working with repositories for about 10 years and have never seen such a story. There are always some nuances associated with the unique features of the company - both the business and the IT landscape. Therefore, hoping that the architecture will be provided by the "vendor" supplying the solution is somewhat reckless. The architecture of such systems often “matures” within the organization itself. Or it is formed by the specialists of the contractor company, which is the main executor for the project.

Data warehouse is an integration project

The data warehouse loads and processes information from many source systems. And in order to maintain "friendly relations" with them, you need to be extremely careful with them. In particular, it is necessary to minimize the load on the source systems, take into account the "availability and unavailability" windows, select interaction interfaces taking into account their architecture, etc. Then the storage will be able to pick up data as early as possible and with the required frequency. Otherwise, you will be "transplanted" to a backup circuit, which is not updated at the most operational frequency.
In addition, it is necessary to take into account the "human factor". Integration is not only about the interaction of machines. It is also communication between people.

Data Warehouse is a Collaborative Project

In a large company, such a system can rarely be done by just one team. As a rule, several teams work here, each of which solves a specific problem.

The architecture should provide the ability to organize their parallel work, while maintaining its integrity and avoiding duplication of the same functionality in different places, by different people. In addition to unnecessary effort, such duplication can lead to data discrepancies later.

In addition, when so many people and teams, often scattered, are involved in the development of the system, the question inevitably arises: how to build communications and information interaction between them. The more standard and understandable approaches and practices are used, the easier, more convenient and efficient it is to organize such work. And, among other things, it is worth thinking about the composition of "working artifacts", among which for data warehouses # 1 are data models (see the previous section).

The data warehouse has a longer life span than other systems

To clarify - the statement is true for a "live", working storage, integrated with key sources, possessing historical data and providing information and analytical services to many divisions of the company.

What grounds do I have for believing so?
Firstly, building a storage is a very resource-intensive process: in addition to the actual costs of equipment, licenses for the necessary technological software and development, almost all systems and divisions of the company are also involved in this. Repeating this entire process from scratch one more time is a very daring idea.

Secondly, if the storage has the correct architecture, then it can quite easily survive the changes of source systems, and the emergence of new requirements from end users, and the growth of data volumes.
If the architecture is correct, information flows are transparent, then such a system can be developed for a long time without the risk of being stuck in a situation when making changes due to difficulties in assessing the impact.

Gradual iterative development

The last thing that the Customer would like, getting involved in the story with the repository, is to freeze their requirements for a year or two, until a complete corporate data model is designed, all sources are fully connected, etc.

In the eyes of Customers, the data warehouse often looks like an absolute monster - the tasks, goals and development horizon of the system are so voluminous. And often the customer is afraid that "at the expense of his budget" the IT department will solve some "their problems." And again we are faced with the issue of interaction between people and the ability to calmly state our position and negotiate.

Competent architectural approaches allow you to develop the system iteratively, increasing the functionality gradually, without going into "development" for several years before starting to give a result.

Although it should be noted that "miracles do not happen" - and the "start" also takes time. For storages, it can be quite large - since these are large amounts of data, this is historical data - for the old periods, when the rules for processing information could differ from the current ones. Therefore, it takes enough time for analytical work, interaction with source systems and a number of "trial and error", including load tests on real data.

Data warehouses - "multi-project story"

It is difficult to single out a single business customer for a data warehouse. And it is believed (not without reason) that the key factor in the success of the project of building a storage facility is the support of the company's management - directly the first person.
A repository is rarely built and developed as part of a single project. Typically, there are different needs for data consolidation and analytics, behind them are different customers and user groups. Therefore, the repository is often developed within the framework of several parallel projects.

Balance of innovation and proven solutions

Despite the fact that the topic of storage is very "ancient" (if such a word is applicable for such a young industry as IT) and rather conservative. Nevertheless, progress does not stand still - and those limitations that previously existed due to expensive and slow disks, expensive memory, etc. - are now removed. At the same time, the time has come to revise some of the architectural approaches. Moreover, this applies both to technological platforms and to the architecture of the applied systems that are based on them.

It is important to strike a balance here - and maintain a fairly “green” approach to both resources and stored information. Otherwise, you can very quickly turn the storage into a semi-structured "dump", in which, if it will be possible to figure it out, then by quite a lot of effort.
Yes, we have more opportunities, but this does not mean that we need to deny all the accumulated and time-tested practices, which it is clear how and why to use, and "go all bad" only led by the foggy ghost of "innovations".
Keeping a balance means using new methods and approaches where they open up new opportunities, but at the same time using old proven ones - to solve urgent problems that have not been canceled.
What can we do as developers and designers of application solutions? First of all, to know and understand the technological changes of the platforms on which we work, their capabilities, features and application boundaries.

Let's look at the DBMS as the most critical and important technological platform for storage.
Recently, there has been a clear drift of relational databases, created initially as "universal", towards specialization. For a long time, leading vendors have been releasing various options - for applications of different classes (OLTP, DSS & DWH). In addition, additional opportunities appear for working with text, geo-data, etc.

But this was not the end of it - products began to appear that were initially focused on a certain class of tasks, i.e. specialized DBMS. They may or may not use the relational model. It is important that they are initially "sharpened" not just for storing and processing "business information" in general, but for specific tasks.

Apparently, centralization and specialization are two complementary trends that periodically replace each other, ensuring development and balance. As well as evolutionary (gradual) gradual development and cardinal changes. For example, in the 90s, Michael Stonebreaker was one of the authors of the Generation III Database Manifesto, which clearly expressed the idea that the world does not need another revolution in the world of databases. However, 10 years later, he publishes works in which he announces the prerequisites for the beginning of a new era in the world of DBMS - based on their specialization.
He focuses on the fact that common universal DBMSs are built on a "one-size-fits-all" architecture, which does not take into account either changes in hardware platforms or division of applications into classes for which you can come up with a more optimal solution than implementing universal requirements.
And he begins to develop a number of projects in accordance with this idea. One of them - C-Store - is a columnar DBMS designed in the shared nothing (SN) architecture, originally created specifically for systems of the class of data warehouses. This product was then marketed as HP Vertica.

It seems that now the topic of the development of data warehouses has slipped into a new stage of development. New technologies, approaches and tools appear. Their study, testing and intelligent application allows us to create really interesting and useful solutions. And bring them to implementation, enjoying the fact that your developments are used in real work and are useful.

Epilogue

In preparing this article, I tried to focus primarily on architects, analysts and developers who directly work with data warehouses. But it turned out that she inevitably "took the topic a little wider" - and other categories of readers fell into the field of vision. Some points will seem controversial, some are not clear, some are obvious. People are different - with different backgrounds, backgrounds and positions.
For example, typical managerial questions are "when to hire architects?", "When to do architecture?" sound for us (developers, designers) rather strange, because for us the architecture of the system appears with its birth - it does not matter whether we are aware of it or not. And even if there is no formal role of an architect in a project, a normal developer always "includes his own internal architect."

By and large, it doesn't matter who exactly performs the role of the architect - it is important that someone asks similar questions and investigates the answers. If the architect is clearly singled out, this only means that he is primarily responsible for the system and its development.
Why did I find the topic of "antifragility" relevant to this subject?

"The uniqueness of antifragility is that it allows us to work with the unknown, to do something in conditions when we do not understand what exactly we are doing, and to achieve success."/ Nassim N. Talb /

Therefore, the crisis and a high degree of uncertainty are not an excuse in favor of the absence of architecture, but factors that reinforce its need.

Epilogue

"The uniqueness of antifragility is that it allows us to work with the unknown, to do something in conditions when we do not understand what exactly we are doing, and to achieve success."/ Nassim N. Talb /

Therefore, the crisis and a high degree of uncertainty are not an excuse in favor of the absence of architecture, but factors that reinforce its need.

Tags: Add Tags

5.1. Organization of data in corporate information systems.

Considering the CIS at the most simplified level, we can say that it contains a corporate computer (computing) network and a specialized application package (PPP) for solving problems in the subject area. In turn, both the PPP and the computer network presuppose the use of information data on the state and development of the systems controlled and controlled by them. Historically, the CIS consists of separate branched subsystems of individual enterprises, interconnected and often representing a hierarchical system. It is natural to assume that such subsystems have both their own sources and their own storage locations for related data. Combining into a single system, questions arise regarding the joint correct use of data geographically located in different places of their storage. Consequently, for the successful management of a production association equipped with an ICC, it needs a reliable system for collecting, storing and processing data. In other words, you need a unified information infrastructure that meets strategic BI (Business Intelligence) projects or an integrated database for storing and using data. The main goal of data integration is to obtain a unified and coherent picture of the state of corporate business data. Integration itself is a complex process, based on which it is advisable to single out:

Technologies,

Products,

Applications.

Methods Are approaches to data integration.

Technologies- these are processes that implement certain methods of data integration.

Products Are commercial solutions that support one or another data integration technology.

Applications- these are ready-made technical solutions supplied by developers in accordance with the wishes of clients - customers.

Depending on the complexity of corporate information systems and on the tasks they are designed to solve, the organization of data in them is somewhat different. In particular, in the CIS, designed to ensure effective management of business processes of both individual branches and the corporation as a whole, it is customary to talk about the presence of corporate databases. In corporate information systems used at the highest levels of management and mostly associated with the processes of operational analysis and decision-making, in the process of planning, design and forecasting of various types of management activities, they use the terminology of a data warehouse. It is pertinent to note that the phrase integrated storage inherent in both.

5.2. Corporate databases and requirements for them

As a system-wide integrated data storage, the corporate database is designed to provide information for effective management of all business processes and divisions of the corporation. Data integration provides for the creation of a new structure that organically includes data from the databases of separate separate divisions, therefore, such a structure should provide certain requirements:

Simple and user-friendly data entry into the database,

Storing data in a way that will not lead to excessive growth of data,

Accessibility to general information of employees of all divisions of the corporation, with the obligatory condition of differentiation of access rights,

Quickly finding and retrieving the required information,

Sorting and filtering the required data,

Grouping of data of the same name,

Intermediate and final calculations above the fields,

· Conversion and clarity of the output data,

Scalability,

· Protection against accidental failures, irrecoverable data loss and unauthorized access.

In addition, when integrating isolated (distributed) databases into a single corporate database, it is important to ensure the ability to work with the database in such a way that the user works with it as with an unallocated one.

The creation of an integrated corporate database is possible by various methods, the main of which are:

Consolidation,

Federalization,

· Spreading.

5.3. Characteristics of corporate database integration solutions

Consolidation. Under consolidation usually means the addition of data of the same name. A similar term is widely used in the banking sector, where an annual consolidated balance sheet is formed, which makes it possible to present all the assets and liabilities of the parent bank together with its branches.

As applied to a corporation, when using this method, data is copied and collected from primary databases (DB - Slave) by integration into a single storage location (DB - Master). As a rule, the server of the central (head) office is chosen as such a storage location (Figure 5.1).

Figure 5.1. Data Consolidation Method

Data in the database - Master is used for reporting, analysis, development and decision-making, as well as a source of data for other branches of the corporation.

The most common technologies for supporting such decisions during consolidation are the following technologies:

· Extract, transformation and loading - ETL (Extract Transform Load);

· Management of the content of the corporation - ECM (Enterprise Content Management).

The advantages of the consolidation method are:

1. Ability to carry out transformation(restructuring, reconciliation, cleaning and / or aggregation) of significant amounts of data in the process of their transfer from primary systems to final storage locations using ETL technology,

2. Ability to manage unstructured data such as documents, reports and pages thanks to ECM technology solutions.

To work with the consolidated CIS database, special business applications, which allow you to create queries to database data, reports and, on their basis, to carry out data analysis.

The disadvantage of integration through consolidation is the inability to update the consolidated data in the integrated storage location in sync with the data updates in the primary systems due to the arising conflicts during synchronization.

There is a time lag between when data is updated in the primary systems and in the final storage location.

This lag can range from a few seconds to several hours or even days.

Federalization. Under federalization usually means union. A similar term is often used in politics when arranging state borders (for example, Germany, Russia, USA).

The process of data federalization in a corporate database is the creation of a virtual (seeming) picture that combines several primary data files into a single virtual whole (see Figure 5.2). Data federation itself is about extracting data from primary systems based on external requirements. Management of the work of the corporate database integrated according to the federal method is carried out by federalization processor.

Fig. 2. Data federation method

By accessing data in a virtual database, any business application generates a request to the virtual picture. The federation processor, based on this request, extracts data from the respective primary systems, integrates it in accordance with the virtual picture, and outputs the result to the business application that generated the request. In this case, all the necessary data transformations are carried out when they are extracted from the primary systems.

The support of a federated approach to data integration is provided by the Enterprise information integration (E I I) technology, which in translation means - the integration of corporate information.

A feature of the federated solution is that the federalization processor uses metadata(knowledge), which include data on the composition and characteristics of the virtual picture, on the amount of data, semantic relationships between them and ways of accessing them, helping the federated solution to optimize access to primary systems.

The main advantages of the federated approach are:

The ability to access current data without creating an additional new database,

Expediency of application after acquisition or merger of companies,

Irreplaceable in cases where, for security reasons, there are licensing restrictions on copying data from primary systems,

Use, if necessary, of the high autonomy of local divisions of the corporation and the flexibility of centralized control of their activities,

· A high degree of utility for large transnational corporations.

The disadvantages of this approach include:

Decreased performance due to the additional cost of accessing multiple data sources,

Federalization is most appropriate for retrieving small amounts of data,

· High requirements for the quality of primary data.

Spreading. Under dissemination usually refers to the territorial transfer of multiplied objects. Data dissemination refers to the propagation of primary databases and their movement from one location to another. When implementing this method business applications operate online and move data to destinations based on specific events that occur. For this technical solution, the issue of data updates, which are possible in synchronous or asynchronous modes, becomes important. Synchronous mode assumes that updates to both the primary system and the target system occur during the same physical transaction.

Examples of technologies that support the implementation of a data dissemination method are:

Integration of enterprise applications EAI - Enterprise Application Integration,

· Replication of corporate data EDR - Enterprise Data Replication.

The generalized structure of the implementation of the data dissemination method is as shown in Figure 5.3.

Figure 5.3. Data dissemination method

A distinctive feature of the data distribution method is the guaranteed delivery of data to the destination system with a minimum delay close to real time.

The combination of technologies of integration (EAI) and replication (EDR) provides multiple advantages, in the form of the following advantages:

· High performance,

· Ability to restructure and clean up data,

· Balancing the load by creating backups and restoring data.

Hybrid approach. The realities of economic activity are such that there are no two identical enterprises, let alone two identical corporations. This circumstance leaves its mark on the process of creating and filling the corporate information system. This also applies entirely to methods of data integration in databases. For this reason, many CIS systems use the so-called hybrid an approach that involves multiple integration methods at the same time, examples of which are technologies that provide a consistent picture of customer information:

Integration of customer data in CDI systems - Customer Data Integration,

· Integration of customer data in modules CRM - Customer Relations Management.

In particular, a CDI implementation approach can be accomplished in a variety of ways.

The simplest way is to create a consolidated customer database that contains data from primary systems. At the same time, the lag of information can be regulated by using various consolidation modes: operational or batch, depending on the frequency of updating this information.

The second way is data federation, when virtual business presentation customer data contained in primary systems. And the metadata file can contain general key elements that can be used to relate customer information.

Thus, general (for example, details) customer data can be consolidated as the most static data. And more dynamic data (such as order information) can be federalized.

Moreover, the hybrid approach can be extended using the data dissemination method. For example, a client using the services of an Internet store, during the service, changes his details. These changes can be sent to the consolidated part of the database, and from there propagated to all primary systems containing data about the store's customers.

Keeping in mind the advantages and disadvantages of each of the methods, it is advisable to creatively approach their application and joint use.

For example, data federation is useful when the costs of consolidating data outweigh the business benefits that consolidation provides. In particular, on-line processing of requests and preparation of reports is exactly such a situation.

Practical applications of the data dissemination method are very diverse, both in terms of performance and in terms of the ability to restructure and cleanse data.

5.4. The concept and structural solutions of data warehouses

Data store - it is a subject-oriented integrated storage of information that accumulates external and operational data, as well as data from other systems, on the basis of which decision-making and data analysis processes are built.

Unlike databases and data banks, the basis of data warehouses is not internal, but external data sources: various information systems, electronic archives, public electronic catalogs, reference books and collections.

The data warehouse concept is based on two main ideas:

1. Integration of disaggregated detailed data (describing specific facts, properties, events, etc.) in a single repository.

2. Separation of datasets and applications used for processing and analysis.

The data warehouse is organized in cases where it is necessary to obtain:

Integration of current and historical data values,

Combining data from disparate sources,

Creation of a reliable data platform for analytical purposes,

Ensuring data consistency throughout the organization,

Facilitating the implementation of corporate data standards without changing existing operating systems,

· Providing a broad historical picture and opportunities for analyzing development trends.

Historically, data warehouses have been built on a one-, two-, and three-tier scheme.

Single-level schemes were originally intended for the simplest architectures, which include functional DSSs, with an insufficiently developed information infrastructure, when analysis is carried out using data from operational systems, according to the principle: data - presentation forms.

The advantages of such schemes are:

Fast data transfer from operational systems to a specialized system without intermediate links,

· Minimum costs due to the use of a single platform.

Disadvantages:

A narrow range of issues to be resolved due to a single data source,

· Poor data quality due to lack of cleaning step.

Two-tier schemes provide a chain: data - data marts - presentation forms. They are used in corporations with a large number of independent divisions that use their own information technologies.

Advantages:

The display cases used are designed to answer a specific set of questions,

· It is possible to optimize the data in the data marts to improve performance.

Disadvantages:

Difficulty ensuring data consistency due to their multiple repetition in storefronts,

Potential complexity of data marts filling with a large number of data sources,

· Due to the lack of data consolidation at the corporate level, there is no single picture of the business.

The evolution of development has led to the fact that the construction of a full-fledged data warehouse for modern corporate systems began to be performed by three-tier architecture (see Figure 5.4).

On first the level contains a variety of recording systems that are sources of data. Such systems can be enterprise resource planning systems (ERP - Enterprise Resource Planning), reference (operational) systems, external sources or systems supplying data from information agencies, etc.

On second level contains a central storage, where data from all sources of the first level are collected, as well as an operational data warehouse, which is designed to perform two functions:

The warehouse is a source of analytical information used for operational management,

· In the operational warehouse, data is prepared for subsequent loading into the central warehouse. Data preparation means carrying out checks and data transformation in connection with various regulations for the receipt of data from the first level.

Third a level is a collection of domain-specific data marts.

Data marts - these are relatively small functionally oriented drives, the content of which contributes to the solution of analytical tasks of individual divisions of the corporation. In fact, data marts are subsets of data from a warehouse. At the same time, end users have the ability to access detailed data of the warehouse, in case there is not enough data in the mart, as well as to get a more complete picture of the state of the business.

Figure 5.4. Data warehouse architecture

The main technological operations of such organized data warehouses are:

· Retrieving data is the process of transferring data from heterogeneous sources to an operational warehouse,

· Transformation data is a modification of data based on special rules with their subsequent transfer to a central storage,

· Cleaning data is the elimination of duplication of data coming from different sources,

· Update data is the propagation of a data update to the original data of the base tables and the derived data hosted in the warehouse.

Advantages:

· Filling storefronts is simplified due to the use of a single source of cleansed data,

Data marts are synchronized with the corporate business picture, making it easy to expand the central repository and add data marts,

· Guaranteed performance.

Disadvantages:

The presence of data redundancy, leading to an increase in requirements for data storage technology,

5. 5. Database management systems and technologies for accessing data in CIS

Database Management System(DBMS) is a set of language and software tools designed to create, maintain and share a database by one or many users.

Currently, the most widespread are DBMS built on the basis of a relational data model described by a rigorous mathematical apparatus theory of relations.

A feature of the DBMS working in the corporate information system is the fact that they have to manage databases located on media distributed in space.

In the interests of eliminating additional duplication or copying of data in the CIS, the main emphasis is on the principle of remote data processing. Databases in CIS contain data required by many users. Obtaining simultaneous access of several users to the database is possible when installing in a local computer network DBMS that work with users and with a single database.

The main technological solutions for multi-user work with databases are file / server and client / server technologies. Taking the most appropriate option from these technologies, the client / server in the CIS are organized specialized systems for processing distributed databases. In this case, the management of distributed databases is carried out in such a way that the data is distributed not at the logical, but at the physical level, and the database itself is considered as a single "supercircuit". In a distributed database, administrative functions are distributed between the integrated database administrator and the local database administrators. The integrated database administrator monitors the differentiation of access of different users to the database and ensures the integrity and safety of data, as well as protection of data from their simultaneous correction by several users. Access control is carried out in accordance with the rights granted to individual users in the network operating system.

A characteristic feature of programs created using the DBMS for working with remote and distributed corporate databases is the use of an open data access interface - ODBC (Open Data Base Connectivity). All functions for data transfer are assigned to the ODBC interface, which is a connecting bridge between the integrated database DBMS and the client application DBMS. In this case, the client's DBMS can interact not only with their local databases, but also with data located in the integrated database. The client has the ability to send requests to the integrated database DBMS, receive data on them and send their own updated data.

Industry data models

The main purpose of models is to facilitate orientation in the data space and help in highlighting the details that are important to business development. In today's environment, for a successful business, it is imperative to have a clear understanding of the links between the various components and to have a good idea of the overall picture of the organization. Identification of all details and relationships using models allows the most efficient use of the time and tools for organizing the work of the company.

Data models are abstract models that describe how data is presented and accessed. Data models define data items and the relationships between them in a particular area. A data model is a navigation tool for both business and IT professionals that uses a specific set of symbols and words to accurately explain a specific class of real-world information. This allows for better communication within the organization and thus creates a more flexible and stable application environment.

The data model uniquely defines the meaning of the data, which in this case is structured data (as opposed to unstructured data such as, for example, an image, binary file, or text, where the meaning can be ambiguous).

As a rule, models of a higher level (and more general in content) and a lower one (respectively, more detailed) are distinguished. The upper level of modeling is the so-called conceptual data models(conceptual data models), which give the most general picture of the functioning of an enterprise or organization. The conceptual model includes the main concepts or subject areas that are critical to the functioning of the organization; usually their number does not exceed 12-15. Such a model describes the classes of entities that are important to the organization (business objects), their characteristics (attributes), and the associations between pairs of these classes (that is, relationships). Since the terminology in business modeling has not yet finally settled down, in various English-language sources, conceptual data models can also be called the subject area model (which can be translated as domain models) or subject enterprise data model (subject corporate data models).

The next hierarchical level is logical data models(logical data models). They can also be called enterprise data models or business models. These models contain data structures, their attributes and business rules, and represent the information used by an enterprise from a business perspective. In such a model, data is organized in the form of entities and relationships between them. The logical model presents data in a way that makes it easy for business users to understand. In a logical model, a data dictionary can be distinguished - a list of all entities with their precise definitions, which allows different categories of users to have a common understanding of all input and information output streams of the model. The next, lower level of modeling is the physical implementation of the logical model using specific software and technical platforms.

The logical model contains a detailed corporate business decision, which usually takes the form of a normalized model. Normalization is a process that ensures that each data item in a model has only one value and is completely and uniquely dependent on the primary key. Data items are organized into groups according to their unique identification. The business rules governing data items must be fully incorporated into the normalized model with prior validation and validation. For example, a data item such as Customer Name is likely to be split into First Name and Last Name and grouped with other related data items into a Customer entity with a primary key Customer ID.

The logical data model is independent of application technologies such as databases, networking technologies, or reporting tools, and the means of their physical implementation. There can be only one Enterprise Data Model in an organization. Logical models typically include thousands of entities, relationships, and attributes. For example, a data model for a financial institution or telecommunications company can contain about 3000 industry concepts.

It is important to distinguish between logical and semantic data model. The logical data model represents an enterprise business solution, and the semantic data model represents an applied business solution. The same corporate logical data model can be implemented using different semantic models, i.e. semantic models can be viewed as the next level of modeling approaching physical models. Moreover, each of these models will represent a separate "slice" of the corporate data model in accordance with the requirements of various applications. For example, in the corporate logical data model, the Client entity will be completely normalized, and in the semantic model for the data mart, it can be represented as a multidimensional structure.

A company can have two ways to create a corporate logical data model: build it independently or use a ready-made one. industry model(industry logical data model). In this case, differences in terms reflect only different approaches to building the same logical model. In the event that a company independently develops and implements its own logical data model, then such a model, as a rule, is called simply a corporate logical model. If an organization decides to use a ready-made product from a professional supplier, then we can talk about an industry logical data model. The latter is a ready-made logical data model that reflects the functioning of a particular industry with a high degree of accuracy. An industry logic model is a domain-specific and integrated view of all the information that must reside in an enterprise data warehouse to answer both strategic and tactical business questions. Like any logical data model, the industry model is independent of application decisions. It also does not include derived data or other calculations for faster data retrieval. As a rule, most of the logical structures of such a model are well embodied in its effective physical implementation. Such models are developed by many suppliers for a wide variety of areas of activity: finance, manufacturing, tourism, healthcare, insurance, etc.

An industry logical data model contains information that is common to the industry and therefore cannot be a comprehensive solution for a company. Most companies have to grow the model by an average of 25% by adding data items and expanding definitions. Out-of-the-box models contain only key data elements, and the rest of the elements must be added to the corresponding business objects during the installation of the model in the company.

Industry logical data models contain a significant amount of abstraction. Abstractions mean the union of similar concepts under common names such as Event or Participant. This adds flexibility and uniformity to industry models. Thus, the concept of an Event is applicable to all industries.

Business Intelligence Specialist Steve Hoberman identifies five factors to consider when deciding whether to acquire an industry data model. The first is the time and money needed to build the model. If an organization needs to achieve results quickly, then the industry model will be beneficial. Using an industry model may not immediately provide a picture of the entire organization, but it can save a significant amount of time. Instead of modeling itself, time will be spent linking existing structures to the industry model and discussing how best to customize it to the needs of the organization (for example, which definitions should be changed and which data items should be added).

The second factor is the time and money required to keep the model in good working order. If the enterprise data model is not part of a methodology that allows you to monitor compliance with its accuracy and compliance with modern standards, then such a model becomes outdated very quickly. The industry data model can prevent this risk from happening as it is kept up to date with external resources. Of course, changes taking place within the organization should be reflected in the model by the company itself, but industry changes will be reproduced in the model by its supplier.

The third factor is experience in risk assessment and modeling. The creation of a corporate data model requires qualified resources from both the business and the IT staff. As a rule, managers are well aware of either the work of the organization as a whole, or the activities of a particular department. Few of them have both broad (company-wide) and deep (within departments) knowledge of their business. Most managers usually know only one area well. Therefore, in order to get the general corporate picture, significant business resources are required. This also increases the demands on the IT staff. The more business resources are required to create and test a model, the more experienced analysts must be. They must not only know how to get information from the business staff, but also be able to find a common point of view in contentious areas and be able to present all this information in an integrated way. The person creating the model (in many cases the same analyst) must have good modeling skills. Building enterprise logic models requires modeling “for the future” and the ability to literally convert complex business “into squares and lines”.

On the other hand, the industry model allows outside expertise to be leveraged. Industry-specific logic models are built using proven modeling methodologies and teams of experienced professionals to avoid common and costly problems that can arise when developing enterprise data models within an organization.

The fourth factor is the existing application infrastructure and supplier relationships. If an organization already uses many tools from the same supplier and has established relationships with him, then it makes sense and the industry model to order from him. This model will be able to work freely with other products from the same supplier.

The fifth factor is intra-industry exchange of information. If a company needs to communicate with other organizations working in the same field, then the industry model can be very useful in this situation. Organizations within the same industry use similar structural components and terminology. Nowadays, in most industries, companies are forced to exchange data in order to successfully conduct business.

The most effective are the industry models offered by professional suppliers. High efficiency of their use is achieved due to the significant level of detail and accuracy of these models. They usually contain many data attributes. In addition, the creators of these models not only have extensive modeling experience, but are also well versed in building models for a particular industry.

Industry data models provide companies with a single, integrated view of their business information. Many companies find it difficult to integrate their data, although this is a prerequisite for most enterprise-wide projects. According to a study by The Data Warehousing Institute (TDWI), more than 69% of organizations surveyed found integration to be a significant barrier to new applications adoption. On the contrary, the implementation of data integration generates tangible income for the company.

The industry data model, in addition to linking to existing systems, provides great benefits for enterprise-wide projects such as Enterprise Resource Planning (ERP), master data management, business intelligence, data quality improvement, and employee development.

Thus, industry logical data models are an effective tool for integrating data and obtaining a holistic view of the business. The use of logical models seems to be a necessary step towards the creation of corporate data warehouses.

Publications

Steve Hoberman. Leveraging the Industry Logical Data Model as Your Enterprise Data Model.
Claudia Imhoff. Fast-Tracking Data Warehousing & Business Intelligence Projects via Intelligent Data Modeling

The corporate database is the central link of the corporate information system and allows you to create a single information space for the corporation. Corporate databases

Share your work on social media

If this work did not suit you at the bottom of the page there is a list of similar works. You can also use the search button

TOPIC V. CORPORATE DATABASES

V .1. Organization of data in corporate systems. Corporate databases.

V .2. DBMS and structural solutions in corporate systems.

V .3. Internet / Intranet technologies and corporate solutions for database access.

V .1. ORGANIZATION OF DATA IN CORPORATE SYSTEMS. CORPORATE DATABASES

Corporate base data is the central link of the corporate information system and allows you to create a single information space for the corporation. Corporate databases (Figure 1.1).

There are various definitions of databases.

Under the database (DB) understand a set of information logically connected in such a way as to make up a single set of data stored in the memory devices of a computer. This set acts as the initial data of the tasks solved in the process of functioning of automated control systems, data processing systems, information and computing systems.

The term database can be summarized as a collection of logically related data intended for sharing.

Under the database is understood as a set of data stored together with such a minimum redundancy that allows them to be used in an optimal way for one or more applications.

The purpose of creating databases as forms of data storage construction of a data system that does not depend on the adopted algorithms (software), the technical means used, the physical location of the data in the computer. The database assumes multipurpose use (several users, many forms of documents and requests of one user).

Basic requirements for databases:

Completeness of data presentation. The data in the database should adequately represent all the information about the object and they should be sufficient for ODS.
Database integrity. The data must be saved when processing their ODS and in any situations that arise during the work.
Data structure flexibility. The database should allow changing data structures without violating its integrity and completeness when external conditions change.
Feasibility. This means that there must be an objective representation of various objects, their properties and relationships.
Availability. It is necessary to provide delimitation of access to data.
Redundancy. The database should have a minimum redundancy in the representation of data about any object.

Knowledge means a set of facts, patterns and heuristic rules that can be used to solve the problem.

Knowledge base (KB)  a set of databases and used rules obtained from decision-makers. The knowledge base is an element of expert systems.

Distinguish different ways of presenting data.

Physical data - it is data stored in the computer memory.

Logical data representation corresponds to a custom view of physical data. The difference between physical and corresponding logical representations of data is that the latter reflects some important relationships between physical data.

Under the corporate database understand a database that unites in one form or another all the necessary data and knowledge about the organization being automated. In corporate information systems, such a concept asintegrated databases, in which the principle of single input and repeated use of information is implemented.

Rice. 1.1. The structure of the interaction of departments with the information resources of the corporation.

Corporate databases are focused (centralized) and distributed.

Lumped (centralized) database is a database, the data of which is physically stored in the storage devices of one computer. In fig. 1.2 presents a diagram of a server application for accessing databases in various platforms.

Figure 1.2. Scheme heterogeneous centralized database

Centralization of information processing made it possible to eliminate such disadvantages of traditional file systems as incoherence, inconsistency and redundancy of data. However, as databases grow, and especially when used in geographically dispersed organizations, problems arise. For example, for concentrated databases located at a node of a telecommunications network, with the help of which various departments of an organization gain access to data, the following difficulties arise with an increase in the volume of information and the number of transactions:

Large flow of data exchange;
High traffic on the network;
Low reliability;
Poor overall performance.

While it is easier to ensure the security, integrity, and consistency of information during updates in a concentrated database, these problems pose certain challenges. Data decentralization is proposed as a possible solution to these problems. Decentralization achieves:

Higher degree of simultaneity of processing due to load balancing;
Improving the use of data in the field when performing remote (remote) queries;
Lower costs;
Ease of managing local databases.

The costs of creating a network in the nodes of which workstations (small computers) are located are much lower than the costs of creating a similar system using a large computer. Figure 1.3 shows the logical diagram of a distributed database.

Figure 1.3. Distributed corporation database.

Let's give the following definition of a distributed database.

Distributed database - it is a collection of information, files (relations) stored in different nodes of the information network and logically connected in such a way as to make up a single set of data (communication can be functional or through copies of the same file). Thus, it is a set of databases that are logically interconnected, but physically located on several machines that are part of the same computer network.

The most important performance requirements for a distributed database are:

Scalability;
Compatibility;
Support for various data models;
Portability;
Location transparency;
Autonomy of nodes of a distributed database (Site Autonomy);
Distributed request processing;
Execution of distributed transactions.
Support for a homogeneous security system.

Location transparency allows users to interact with databases without knowing anything about their location. The autonomy of distributed database nodes means that each database can be maintained independently of the others. A distributed query is a query (SQL statement), during the execution of which objects (tables or views) of different databases are accessed. When executing distributed transactions, concurrency control of all involved databases is performed. Oracle7 uses two-phase information transfer technology to perform distributed transactions.

The databases that make up a distributed database do not have to be homogeneous (i.e., be maintained by one DBMS) or processed in the environment of the same operating system and / or on computers of the same type. For example, one database can be an Oracle database on a SUN machine running SUN OS (UNIX), a second database can be hosted by a DB2 database on an IBM 3090 mainframe with an MVS operating system, and a third database can be maintained by SQL / DS also on the IBM mainframe, but with the VM operating system. Only one condition is required - all machines with databases must be accessible over the network they are part of.

The main task of a distributed database - distribution of data over the network and providing access to it. There are the following ways to solve this problem:

Each node stores and uses its own dataset that is available for remote queries. This distribution is divided.
Some data frequently used at remote sites may be duplicated. This distribution is called partially duplicated.
All data is duplicated at each node. This distribution is called fully duplicated.
Some files can be split horizontally (a subset of records is selected) or vertically (a subset of attribute fields is selected), while the selected subsets are stored in different nodes along with unsplit data. This distribution is called split (fragmented).

When creating a distributed database, at the conceptual level, you have to solve the following tasks:

It is necessary to have a single conceptual diagram of the entire network. This will provide logical transparency of data for the user, as a result of which he will be able to form a request to the entire database, being behind a separate terminal (it seems to work with a centralized database).
A schema is needed to locate the data on the network. This will provide transparency of data placement, thanks to which the user does not have to specify where to send the request to get the required data.
It is necessary to solve the problem of heterogeneity of distributed databases. Distributed databases can be homogeneous or heterogeneous in terms of hardware and software. The problem of heterogeneity is relatively easy to solve if the distributed database is heterogeneous in the sense of hardware, but homogeneous in the sense of software (the same DBMS in the nodes). If different DBMSs are used in the nodes of a distributed system, means of transforming data structures and languages are required. This should provide transparency of transformation across the nodes of the distributed database.
It is necessary to solve the problem of dictionary management. To provide all kinds of transparency in a distributed database, you need programs that manage numerous dictionaries and reference books.
You need to define methods for executing queries in a distributed database. The methods for executing queries in a distributed database differ from those in centralized databases, since individual parts of the queries need to be executed at the location of the relevant data and the partial results must be passed to other nodes; at the same time, coordination of all processes must be ensured.
It is necessary to solve the problem of parallel query execution. A distributed database requires a sophisticated concurrency control mechanism, which, in particular, must ensure synchronization when information is updated, which ensures data consistency.
A developed methodology for the distribution and placement of data is required, including splitting, is one of the main requirements for a distributed database.

One of the actively developing new areas of architecture of computing systems, which is a powerful tool for non-numerical information processing, are database machines... Database machines are used to solve non-numeric tasks such as storing, searching and transforming documents and facts, and working with objects. Following the definition of data as digital and graphic information about objects of the surrounding world, different content is embedded in the concept of data in numerical and non-numerical processing. Numerical processing uses objects such as variables, vectors, matrices, multidimensional arrays, constants, and so on, while non-numeric processing uses objects such as files, records, fields, hierarchies, networks, relationships, etc. non-numeric processing is interested directly in information about objects (for example, a specific employee or a group of employees), and not in the file of employees as such. The file of employees is not indexed here to select a specific person; here the content of the desired entry is more interesting. Large amounts of information are usually subjected to non-numerical processing. In various applications, you can perform, for example, the following operations on this data:

increase the salary of all employees of the company;
calculate the bank interest on the accounts of all clients;
make changes to the list of all goods in stock;
find the required abstract from all texts stored in the library or in the bibliographic information retrieval system;
find a description of the required contract in a file containing legal documents;
look through all files containing descriptions of patents and find a patent (if any) similar to the proposed one again.

To implement the database engine, parallel and associative architecture as an alternative to uniprocessorvon Neumannstructure, allowing to work with large amounts of information in real time.

Database machines are gaining importance in connection with the research and application of artificial intelligence concepts such as knowledge representation, expert systems, inference, pattern recognition, etc.

Information storages. Today, many admit that already now, most companies operate several databases and, for successful work with information, not only different types of databases are required, but different generations of DBMS. According to statistics, each organization uses an average of 2.5 different DBMSs. It became obvious the need to "isolate" the business of companies, or rather, the people involved in this business, from the technological features of databases, to provide users with a single view of corporate information, regardless of where it is physically stored. This stimulated the emergence of information storage technology ( Data Warehousing, DW).

The main goal of DW is creation of a single logical representation of data contained in different types of databases, or, in other words, a single corporate data model.

The new round of DW development became possible due to the improvement of information technologies in general, in particular, the emergence of new types of databases based on parallel query processing, which in turn relied on advances in the field of parallel computers. Were created query builderswith an intuitive graphical interface, which made it easy to build complex queries to the database. Various softwaremiddle layer (midleware)provided a connectionbetween heterogeneous databases, and finally fell sharplystorage devices.

A data bank may be present in the structure of a corporation.

Database - a functional and organizational component in automated control systems and information and computing systems, providing centralized information support for a team of users or a set of tasks solved in the system.

Database is considered as an information and reference system, the main purpose of which is:

in the accumulation and maintenance in working order of a set of information that constitutes the information base of the entire automated system or a certain set of tasks solved in it;
in the issuance of the data required by the task or user;
in providing collective access to stored information;
in ensuring the necessary management of the use of information contained in the information base.

Thus, a modern databank is a complex software and hardware complex, which includes technical, system and network tools, databases and DBMS, information retrieval systems for various purposes.

V .2. DBMS AND STRUCTURAL SOLUTIONS IN CORPORATE SYSTEMS

Database and knowledge management systems

An important component of modern information systems are database management systems (DBMS).

DBMS - a set of software and language tools intended for the creation, maintenance and use of databases.

The database management system provides access of data processing systems to databases. As already noted, DBMSs acquire an important role in the creation of corporate information systems and, a particularly important role, in the creation of information systems using distributed information resources based on modern network computer technologies.

The main feature of modern DBMS is that modern DBMS support technologies such as:

Client / server technology.
Database languages support. itschema definition language DB (SDL - Schema Definition Language),data manipulation language (DML), integrated languages SQL (Structured Queue Language), QDB (Query - By - Example) and QMF (Query Management Facility ) Is an advanced peripheral query specification and reporting tool for DB 2, etc .;
Direct data management in external memory.
Management of RAM buffers.
Transaction management. OLTP - technology (On-Line Transaction Processing), OLAP - technology (On-Line Analysis Processing) for DW.
Ensure data protection and integrity. The use of the system is allowed only to users who have the right to access the data. When users perform operations on data, the consistency of the stored data (integrity) is maintained. This is important in corporate multi-user information systems.
Journalization.

Modern DBMS must ensure compliance with the database requirements listed above. In addition, they must comply with the following principles:

Data independence.
Versatility. The DBMS must have powerful conceptual data model support for displaying custom logical views.
Compatibility. The DBMS must remain operational with the development of software and hardware.
Redundancy of data. Unlike file systems, a database must be a single collection of integrated data.
Data protection. The DBMS must provide protection against unauthorized access.
Data integrity. The DBMS must prevent users from breaking the database.
Management of simultaneous work. The DBMS must protect the database from inconsistencies in the shared access mode. To ensure a consistent state of the database, all user requests (transactions) must be executed in a specific order.
The DBMS must be universal. It should support different data models on a single logical and physical basis.
The DBMS must support both centralized and distributed databases and, thus, become an important link in computer networks.

Considering a DBMS as a class of software products focused on maintaining databases in automated systems, we can distinguish two most essential features that determine the types of DBMS. According to them, a DBMS can be viewed from two points of view:

their capabilities in relation to distributed (corporate) databases;
their relationship to the type of the data model implemented in the DBMS.

In relation to corporate (distributed) databases, the following types of DBMS can be conventionally distinguished:

"Desktop" DBMS. These products are primarily focused on working with personal data ("desktop" data). They have command sets for sharing common databases, but small in size (like a small office). First of all, it is a DBMS such as Assess, dBASE, Paradox, EohPgo. Why Assess, dBASE, Paradox, EohPgo have poor access to corporate data. The point is, there is no easy way to overcome the barrier between personal and corporate data. And the point is not even that the mechanism of the personal data DBMS (or small office) is focused on accessing data through many gateways, internetworking products, etc. The problem is that these mechanisms are usually associated with full file transfers and the lack of forked index support, which results in server queues practically stalling on large systems.
Specialized high-performance multi-user DBMS. Such DBMSs are characterized by the presence of a multiuser system kernel, a data manipulation language and the following functions typical for developed multiuser DBMSs:
organization of the buffer pool;
the presence of a system for processing queues of transactions;
the presence of mechanisms for multi-user data locking;
transaction logging;
the availability of access control mechanisms.

These are DBMS like Oracle, DB2, SQL / Server, Informix, Sybase, ADABAS, Titanium and others provide a wide service for processing corporate databases.

When working with databases, the transaction mechanism is used.

Transaction Is a logical unit of work.

Transaction is a sequence of data manipulation statements executedas a whole(all or nothing) and translating databasefrom one holistic state to another holistic state.

A transaction has four important properties known as ASID properties:

(A) Atomicity ... A transaction is executed as an atomic operation - either the entire transaction is executed, or it is not executed entirely.
(C) Consistency... A transaction moves a database from one consistent (consistent) state to another consistent (consistent) state. Within a transaction, database consistency can be violated.
(I) Insulation ... Transactions of different users should not interfere with each other (for example, as if they were executed strictly in turn).
(E) Durability... If the transaction is completed, then the results of its work should be saved in the database, even if the next moment the system crashes.

The transaction usually starts automatically from the moment the user connects to the DBMS and continues until one of the following events occurs:

Command COMMIT WORK issued.
The ROLLBACK WORK command was issued.
The user has disconnected from the DBMS.
There was a failure of the system.

For the user, she usually wears atomic character... In fact, this is a complex user (application) - database interaction mechanism. Enterprise systems software uses a real-time transaction processing engine (On-lineTransaction Processing Systems, OLTP), in particular accounting programs, software for receiving and processing client orders, financial applications, produce a lot of information. These systems are designed (and appropriately optimized) to handle large amounts of data, complex transactions, and intensive read / write operations.

Unfortunately, the information placed in the databases of OLTP systems is not very suitable for use by ordinary users (due to the high degree of normalization of tables, specific data presentation formats, and other factors). Therefore, data from different information pipelines are sent (in the sense, copied) to storage warehouse, sorting and subsequent delivery to the consumer. In information technology, the role of warehouses is played byinformation storages.

Delivery of information to the end user - real-time analytical data processing systems (On-line Analytical Processing, OLAP)that provide extremely easy access to data through convenient means of generating queries and analyzing results. In OLAP systems, the value of an information product increases due to the use of various methods of analysis and statistical processing. In addition, these systems are optimized in terms of the speed of data extraction, collection of generalized information and are aimed at ordinary users (they have an intuitive interface). If OLTP system gives answers to simple questions like "what was the level of sales of product N in region M in January 199x?", then OLAP systems ready for more complex user requests, for example: "To provide an analysis of sales of product N in all regions according to the plan for the second quarter in comparison with the two previous years."

Client / server architecture

In modern systems distributed information processing, technology takes center stage client / server. In system client-server architecturedata processing is divided between the client computer and the server computer, communication between which takes place over the network. This separation of data processing is based on the grouping of functions. Typically, a database server computer is dedicated to performing database operations, and a client computer runs application programs. Figure 2.1 shows a simple client-server architecture system that includes a computer acting as the server and another computer acting as its client. Each machine performs different functions and has its own resources.

		Database
Server computer
Network
IBM Compatible PC	IBM Compatible PC	IBM Compatible PC
		Applications

Rice. 2.1. Client-server architecture system

The main function of the client computer is to execute the application (user interface and presentation logic) and communicate with the server when required by the application.

Server Is an object (computer) that provides services to other objects at their request.

As follows from the term itself, the main function of the server computer is to serve the needs of the client. The term "Server" is used to denote two different groups of functions: a file server and a database server (hereinafter, these terms mean, depending on the context, either software that implements the specified groups of functions, or computers with this software). File servers are not designed to perform database operations, their main function is to share files between multiple users, i.e. providing simultaneous access of many users to files on the computer - file server. An example of a file server is Novell's NetWare operating system. The database server can be installed and operated on a file server computer. Oracle DBMS in the form of NLM (Network Loadable Module) is executed in the NetWare environment on the file server.

The local network server must have the resources appropriate to its functional purpose and the needs of the network. Note that in connection with the focus on the open systems approach, it is more correct to talk about logical servers (meaning a set of resources and software that provide services over these resources), which are not necessarily located on different computers. A feature of a logical server in an open system is that if, for reasons of efficiency, it is advisable to move the server to a separate computer, then this can be done without the need for any modification, both of itself and of the applications that use it.

One of the important server requirements is that the operating system hosting the database server must be multitasking (and preferably, but not necessarily multiuser). For example, an Oracle DBMS installed on a personal computer with an MS-DOS (or PC-DOS) operating system that does not meet the multitasking requirement cannot be used as a database server. And the same Oracle database installed on a computer with a multitasking (though not multiuser) OS / 2 operating system can be a database server. Many flavors of UNIX, MVS, VM and some other operating systems are both multitasking and multi-user.

Distributed computing

The term "distributed computing" is often used to refer to two different, albeit complementary, concepts:

Distributed database;
Distributed data processing.

The application of these concepts makes it possible to organize access to information stored on multiple machines for end users using different means.

There are many types of servers:

Database server;
Print server;
Remote access server;
Fax server;
Web server, etc.

At the heart of the underlying technology is Client / Server are such basic technologies as:

Operating system technologies, the concept of interaction of open systems, the creation of object-oriented environments for the functioning of programs;
Telecommunication technologies;
Network technologies;
Graphical User Interface Technologies ( GUI);
Etc.

Advantages of client-server technology:

Client / server technology allows computing on heterogeneous computing environments. Platform independence: Access to heterogeneous network environments that include different types of computers with different operating systems.
Independence from data sources: access to information from heterogeneous databases. Examples of such systems are DB2, SQL / DS, Oracle, Sybase.
Load balance between client and server.
Perform computation where it is most efficient;
Provide the ability to efficiently scale;
Cross-platform computing... Cross-platform computing is simply defined as the implementation of technologies in heterogeneous computing environments. The following possibilities should be provided here:
The application must run on multiple platforms;
On all platforms, it must have the same interface and logic of work;
The application must integrate with the native operating environment;
It should behave the same on all platforms;
Simple and consistent support should be provided for it.

 Distributed computing. Distributed computing involves the distribution of work among several computers (although distributed computing is a broader concept).

 Downsizing. Downsizing is the porting of mainframe applications to small computer platforms.

Reduced infrastructure and hardware costs. Cost-effective: The availability of low-cost computing equipment and the increasing proliferation of local area networks make client-server technology more cost-effective than other data processing technologies. The equipment can be upgraded as soon as the need arises.

 Reducing the overall execution time of the application;

 Reducing client memory usage;

 Reducing network traffic.

Ability to work with multimedia: to date, many multimedia programs have been created for the PC. There are either no such programs for the terminal-host configuration, or they are very expensive.
The ability to attract large computing resources for database operations: since applications are executed on client computers, additional (compared to the terminal-host configuration) resources are freed up on the server computer for database operations, such as computing resources of the central processor and operational memory.
Better Programmer Productivity: Programmer productivity is increased by using tools such as SQL * Forms and CASE, which allow you to develop applications faster than programming languages such as C, PL1, or COBOL.
Increased End User Productivity: By now, many end users have mastered systems such as Lotus, Paradox, Word Perfect, Harvard Graphics, and more.

The server-side interface is defined and fixed. Therefore, it is possible to create new client parts of an existing system (an example of interoperability at the system level).

Rice. 2.2. Illustration of client access to a server share.

How to implement client-server technology

The installation of a system based on client-server technology and capable of performing distributed data processing is discussed below. The following computer hardware and software is required:

database server computer;
client computers;
communication network;
network software;
application software.

SQL language ... High Level Query Language - SQL (Structured Query Language ) serves to implement queries to databases, such as YAMD, YOD and PNP and is adopted as a standard. Language SQL was originally adopted as the data language of the company's software products IBM and YAMD relational DBMS SYSTEM R from IBM ... An important feature of the language SQL lies in the fact that the same language is represented through two different interfaces, namely: through an interactive interface and through an application programming interface (dynamic SQL). Dynamic SQL consists of many built-in language features SQL , provided specifically for the construction of interactive applications, where an interactive application is understood as a program that is written to support access to the database of the end user working on the interactive terminal. Language SQL provides the functions of defining, manipulating and managing database data and is transparent to the user from the point of view of the implemented DBMS.

Rice. 2.3. Scheme for executing user queries to distributed databases.

The internal structure of the databases is determined by the data models used. The conceptual model has more abstraction capabilities and richer semantics than external models. External models are often referred to as syntactic or operational models, referring to the syntactic nature of control and use as a means of user interaction with the database. In Information Modeling, there are different levels of abstraction, from the conceptual model to the physical data model, that affect the architecture of the DBMS.

The data model has three components:

The data structure to represent from the user's point of view of the database.
Valid operations performed on the data structure. It is necessary to be able to work with this structure using various operations of NOD and NAM. A rich structure is worthless if there is no way to manipulate its contents.
Integrity control constraints. The data model should be provided with means to maintain its integrity and protect it. As an example, consider the following two constraints:
Each subtree must have a source node. Hierarchical databases cannot store child nodes without a source node.
With respect to a relational database, there cannot be identical tuples. For a file, this requirement requires all records to be unique.

One of the most important characteristics of a DBMS is the ability to link objects.

There are the following types of links between objects:

One-to-One (1: 1)... One object of one set can be associated with one object of another set.
One-to-Many (1: M)... One object of one set can be associated with many objects of another set.
Many-to-Many (M: N)... One object of one set can be associated with many objects of another set, but at the same time one object of another set can be associated with many objects of the first set.
Ramified ... One object of one set can be associated with objects of many sets.
Recursive ... One object of a given set can be linked by an object of the same set.

The following basic data models exist:

Relational data model.
Hierarchical data model.
Incomplete network data model.
CODASYL data model.
Extended network data model.

V .3. INTERNET / INTRANET TECHNOLOGIES AND CORPORATE DATABASE ACCESS SOLUTIONS

The main problem of systems based on the client-server architecture is that, in accordance with the concept of open systems, they are required to be mobile in the widest possible class of hardware and software solutions of open systems. Even if we restrict ourselves to UNIX-based local area networks, different networks use different equipment and communication protocols. Attempts to create systems that support all possible protocols lead to their overload with network details to the detriment of functionality.

An even more complex aspect of this problem is associated with the possibility of using different representations of data in different nodes of a heterogeneous local network. Different computers may have different addressing, number representation, character encoding, etc. This is especially important for high-level servers: telecommunications, computing, databases.

A common solution to the problem of mobility in systems based on a client-server architecture is to rely on software packages that implement Remote Procedure Call (RPC) protocols. With these tools, a call to a service at a remote site looks like a normal procedure call. RPC tools, which naturally contain all the information about the specifics of the local network hardware and network protocols, translates the call into a sequence of network interactions. Thus, the specifics of the network environment and protocols are hidden from the application programmer.

When a remote procedure is called, RPC programs convert client data formats to intermediate machine-independent formats and then convert to server data formats. When passing the response parameters, similar transformations are performed.

Other similar works that may interest you. Wshm>
6914.		Database concept	11.56 KB
	The database is presented in an objective form, a set of independent materials of articles of calculations of normative acts of court decisions and other similar materials systematized in such a way that these materials can be found and processed using an electronic computer Civil Code of the Russian Federation Art. A database organized in accordance with certain rules and maintained in the computer memory is a set of data characterizing the current state of some ...
8064.		Distributed databases	43.66 KB
	Distributed databases A distributed database RDB is understood as a set of logically interconnected shared data that are physically distributed across different nodes of a computer network. Data access should not depend on the presence or absence of data replicas. The system should automatically determine the methods for performing the data fusion connection, the network channel is able to cope with the amount of transmitted information and the node has sufficient processing power to join the tables. The RDBMS must be capable of ...
20319.		DATABASES AND THEIR PROTECTION	102.86 KB
	Online online databases emerged in the mid-1960s. Operations on operational databases were processed interactively using terminals. Simple index-sequential record organizations quickly evolved to a more powerful set-oriented record model. Charles Bachmann received the Turing Prize for leading the Data Base Task Group (DBTG), which developed a standard language for data description and data manipulation.
5031.		Database Development Library	11.72 MB
	Database design technology. Determining relationships between entities and creating a data model. The main ideas of modern information technology are based on the concept according to which data should be organized into databases in order to adequately reflect the changing real world and meet the information needs of users. These databases are created and operate under the control of special software systems called database management systems DBMS.
13815.		HIERARCHICAL DATABASE MODEL	81.62 KB
	The main ideas of modern information technology are based on the concept of databases, according to which the basis of information technology is data organized in databases that adequately reflect the state of a particular subject area and provide the user with relevant information in this subject area. It must be recognized that the data is ...
14095.		Library database development	11.72 MB
	The increase in the volume and structural complexity of the stored data, the expansion of the circle of users of information systems have led to the widespread use of the most convenient and relatively easy to understand relational (tabular) DBMS.
5061.		Polyclinic database creation	2.4 MB
	The development of computer technology and information technology has provided opportunities for the creation and widespread use of automated information systems (AIS) for various purposes. Information systems for managing economic and technical facilities are being developed and implemented
13542.		Geological information databases	20.73 KB
	Recently, the introduction of computer technologies and, in particular, databases, into the scientific field has been rapidly going on. This process does not bypass geology either, since it is in the natural sciences that there is a need for storing and processing large amounts of information.
9100.		Database. Basic concepts	26.28 KB
	A database is a collection of information about specific objects of the real world in any subject area of economics, management, chemistry, etc. The purpose of an information system is not just storage of data about objects, but also manipulation of this data, taking into account connections between objects. Each object is characterized by a set of properties data, which are called attributes in the database.
5240.		Creation of the database "Dean's office"	1.57 MB
	Database (DB) is a set of interconnected data stored together on external storage media of a computer, with such an organization and minimal redundancy that allows them to be used in an optimal way for one or several applications

The purpose of the lecture

After studying the material of this lecture, you will know:

what enterprise data model ;
how to convert enterprise data model into the data warehouse model;
main elements corporate data model ;
presentation layers of the corporate data model ;
an algorithm for transforming an enterprise data model into a multidimensional data warehouse model ;

and learn to:

develop data warehouse models based on corporate data model organizations;
design a star schema using CASE tools;
partition tables multidimensional model using CASE tools.

Enterprise data model

Introduction

The core of any HD is its data model. Without a data model, it will be very difficult to organize the data in the HD. Therefore, CD developers should spend time and effort on developing such a model. The development of the HD model falls on the shoulders of the HD designer.

Compared with the design of OLTP-systems, the design methodology of CD has a number of distinctive features associated with the orientation of the storage data structures to solve the problems of analysis and information support of the decision-making process. The HD data model should provide an effective solution to precisely these problems.

The starting point in the design of CD can be the so-called enterprise data model(corporate data model or enterprise data model, EDM), which is created in the process of designing an organization's OLTP systems. When designing corporate data model usually an attempt is made to create a data structure based on business operations that would collect and synthesize all the information needs of an organization.

Thus, enterprise data model contains the necessary information to build a CD model. Therefore, at the first stage, if such a model exists in the organization, the HD designer can start the HD design by solving the transformation problem corporate data model into the HD model.

Enterprise data model

How to solve the transformation problem corporate data model into the HD model? To solve this problem, you need to have this model, i.e. corporate data model should be built and documented... And you need to understand what from this model and how should be transformed into the HD model.

Let us clarify from the standpoint of a CD designer the concept corporate data model. Under corporate data model understand a multi-level, structured description of an organization's subject areas, subject area data structures, business processes and business procedures, organizational data flows, state diagrams, data-process matrices, and other model representations that are used in the organization's activities. Thus, in the broadest sense of the word, enterprise data model is a set of models of various levels that characterize (model at some abstract level) the activities of an organization, i.e. content corporate model directly depends on what model constructions were included in it in a given organization.

The main elements corporate data model are:

description of the subject areas of the organization (definition of areas of activity);
relationships between the subject areas defined above;
information data model (ERD -model or entity-relationship model);
description for each subject area:
- entity keys;
- entity attributes;
- subtypes and supertypes;
- relationships between entities;
- grouping attributes;
- relationships between subject areas;
functional or business process model;
data flow diagrams;
state diagrams;
other models.

Thus, enterprise data model contains entities, attributes, and relationships that represent the information needs of an organization. In fig. 16.1 shows the main elements corporate data model.

Presentation levels of the enterprise data model

The Enterprise Data Model is subdivided according to subject areas, which represent groups of entities related to supporting specific business needs. Some subject areas may cover specific business functions such as contract management, while others may include entities that describe products or services.

Each logical model must correspond to the existing domain corporate data model... If the logical model does not meet this requirement, a domain model must be added to it.

An enterprise data model usually has several levels of presentation. In fact high level(high level) corporate data model there is a description of the main subject areas of the organization and their relationships at the entity level. In fig. 16.2 is a snippet corporate data model top level.

Rice. 16.2.

The diagram shown in the figure presents four subject areas: "Buyer" ( Customer), "Check" ( account), "Order" ( Order) and "Product" ( Product). As a rule, only direct connections between subject areas, which, for example, record the following fact: the buyer pays the invoice for the order of goods. Details and indirect relationships at this level corporate model not shown.

On the next one, middle level(mid level) corporate data model detailed information about objects of subject areas is shown, i.e. keys and entity attributes, their relationships, subtypes and supertypes, etc. For each domain of the top-level model, there is one middle-level model. In fig. 16.3 shows the middle level of presentation corporate model for a fragment of the "Order" subject area.

From fig. 16.3 it can be seen that the "Order" subject area ( Order) includes several entities, defined through their attributes, and the relationships between them. The presented model allows you to answer such questions as the date of the order, who made the order, who sent the order, who receives the order, and a number of others. From the above diagram, it can be seen that in this organization there are two types of orders - orders for a promotion ( Commersial) and retail orders ( Retail).

notice, that enterprise data model can represent various aspects of the organization's activities and with varying degrees of detail and completeness. If corporate model represents all aspects of the organization's activities, it is also called organization data model(enterprise data model).

From the point of view of designing a CD, an important factor in deciding to create a CD model from corporate data model is the state completeness corporate data model.

The corporate data model of the organization has the characteristic of evolution, i.e. it is constantly developing and improving. Some subject areas corporate data model may be well developed, for some the work may not have started yet. If a fragment of the subject area has not been worked out in corporate data model, then there is no way to use this model as a starting point for the design of CD.

Completion degree corporate model can be leveled in the design of CD as follows. Since the HD development process is usually divided in time into a sequence of stages, the process of its design can be synchronized with completion process development of individual fragments corporate data model organizations.

At the lowest presentation layer of the corporate data model information about the physical characteristics of database objects corresponding to logical data model middle presentation layer of the corporate data model.

Zaitsev S.L., Ph.D.

Repeating groups

Duplicate groups are attributes for which a single instance of an entity can have more than one value. For example, a person can have more than one skill. If, in terms of business requirements, we need to know the skill level for each, and each person can only have two skills, we can create the entity shown in Fig. 1.6. Here is the entity A PERSON with two attributes for storing skills and skill level for each.

Rice. 1.6. This example uses repeating groups.

The problem with repeating groups is that we cannot know exactly how many skills a person might have. In real life, some people have one skill, some have several, and some have none yet. Figure 1.7 shows the model reduced to the first normal form. Note the added Skill ID that each uniquely identifies SKILL.

Rice. 1.7. Model reduced to first normal form.

One fact in one place

If the same attribute is present in more than one entity and is not a foreign key, then this attribute is considered redundant. The logical model should not contain redundant data.

Redundancy requires additional space, but while memory efficiency is important, the real problem lies elsewhere. Ensuring that redundant data is synchronized is overhead, and you always run the risk of conflicting values.

In the previous example SKILL depends on Person ID and from Skill ID. This means that you will not have SKILL until it appears A PERSON, possessing this skill. This also makes it difficult to change the Skill Name. It is necessary to find each entry with the Name of the skill and change it for each Person who owns this skill.

Figure 1.8 shows the model in second normal form. Note that the added entity SKILL, and the attribute TITLE the skill is transferred to this entity. The skill level remained, respectively, at the intersection PERSONS and SKILL.

Rice. 1.8. In second normal form, the repeating group is moved to another entity. This provides the flexibility to add the required number of Skills and change the Skill Name or Skill Description in one place.

Each attribute depends on the key

Each attribute of an entity must depend on the primary key of that entity. In the previous example School name and Geographic area present in the table A PERSON but do not describe the person. To achieve the third normal form, you need to move the attributes to the entity, where they will depend on the key. Figure 1.9. shows the model in third normal form.

Rice. 1.9. In third normal form School name and Geographic region transferred to entity, where their values depend on the key.

Many-to-many relationships

Relationship many-to-many reflect the reality of the surrounding world. Note that in Figure 1.9, there is a many-to-many relationship between PERSONOUS and SCHOOL... The attitude accurately reflects the fact that A PERSON can study in many SCHOOLS and in SCHOOL can learn a lot PERSON. To achieve the fourth normal form, an associative entity is created that eliminates the monogy-to-many relationship by generating a separate entry for each unique combination of school and person. Figure 1.10 shows the model in fourth normal form.

Rice. 1.10. In fourth normal form, a monogo-to-many relationship between PERSONOUS and SCHOOL resolved by introducing an associative entity, in which a separate entry is allocated for each unique combination SCHOOLS and PERSONS.

Formal definitions of normal forms

The following definitions of normal forms can seem daunting. Think of them simply as formulas for achieving normalization. Normal forms are based on relational algebra and can be interpreted as mathematical transformations. Although this book is not devoted to a detailed discussion of normal forms, modelers are encouraged to take a deeper look at the subject.

In a given relation R, the Y attribute functionally depends on the X attribute. In symbolic form, RX -> RY (read as "RX functionally defines RY") - if and only if each X value in R is associated with exactly one Y value in R (at any given time). Attributes X and Y can be compound (Date CJ. Introduction to Database Systems. 6th edition. Ed. Williams: 1999, 848 pp.).

The relation R corresponds to the first normal form (1NF) if and only if all domains belonging to it contain only atomic values (Date, ibid.).

A relation R corresponds to second normal form (2NF) if and only if it corresponds to 1NF, and each non-key attribute is completely dependent on the primary key (Date, ibid.).

A relation R corresponds to third normal form (3NF) if and only if it corresponds to 2NF, and each non-key attribute does not transitively depend on the primary key (Date, ibid.).

The relation R corresponds to Boyes-Codd normal form (BCNF) if and only if each determinant is a candidate for use as a key.

NOTE Below is a brief explanation of some of the abbreviations used in Date's definitions.

MVD (multi-valued dependency) is a multi-valued dependency. Used only for entities with three or more attributes. In a multivalued dependency, the value of the attribute depends only on a portion of the primary key.

FD (functional dependency) - functional dependency. With functional dependency, the value of an attribute depends on the value of another attribute that is not part of the primary key.

JD (join dependency) is a join dependency. With a union dependency, the primary key of the parent entity is traced back to at least the third level descendants, while retaining the ability to be used in the union by the original key.

The ratio corresponds to the fourth normal form (4NF) if and only if there is an MVD in R, for example A®®B. In this case, all the attributes of R functionally depend on A. In other words, in R there are only dependencies (FD or MVD) of the form K®X (i.e., the functional dependence of the attribute X on the candidate for use as a key K). Accordingly, R meets the requirements of 4NF if it complies with BCNF and all MVDs are actually FDs (Date, ibid.).

For the fifth normal form, the relation R satisfies the union dependence (JD) * (X, Y,…, Z) if and only if R is equivalent to its projections onto X, Y, ..., Z, where X, Y ,. .., Z is a subset of the set of attributes R.

There are many other normal forms for complex data types and specific situations that are beyond the scope of this discussion. Any model development enthusiast would like to learn other normal forms as well.

Business normal forms

In his book, Clive Finklestein (An Introduction to Information Engineering: From Strategic Planning to Information Systems. Reading, Massachusetts: Addison-Wesley, 1989) took a different approach to normalization. It defines business normal forms in terms of coercion to those forms. Many modelers find this approach more intuitive and more pragmatic.

The first business normal form (1BNF) takes out repeating groups to another entity. This entity gets its own name and primary (composite) key attributes from the original entity and its repeating group.

The second business normal form (2BNF) takes out attributes that are partially dependent on the primary key to another entity. The primary (composite) key of this entity is the primary key of the entity in which it was originally located, along with additional keys on which the attribute depends entirely.

The third business normal form (3BNF) takes attributes that are independent of a primary key into another entity, where they are completely dependent on the primary key of that entity.

The fourth business normal form (4BNF) takes attributes that depend on the value of the primary key or are optional to a secondary entity, where they depend entirely on the value of the primary key, or where they must (necessarily) be present in that entity.

The fifth business normal form (5BNF) appears as a structural entity if there is a recursive or other dependency between instances of a secondary entity, or if a recursive dependency exists between instances of its primary entity.

Completed Logical Data Model

The completed logical model must satisfy the requirements of the third business normal form and include all the entities, attributes, and relationships necessary to support the data requirements and business rules associated with the data.

All entities must have names that describe their content and have a clear, concise, complete description or definition. A future post will cover an initial set of guidelines for the correct formation of entity names and descriptions.

Relationships should include a verb construction that describes the relationship between entities, along with characteristics such as plurality, necessity of existence or possibility of absence of a relationship.

NOTE Plurality relationship describes the maximum number of secondary entity instances that can be associated with an instance of the original entity.Necessity of existence orpossibility of absence relationship is used to determine the minimum number of instances of a secondary entity that can be associated with an instance of the original entity.

Physical data model

Once you have created a complete and adequate logical model, you are ready to make the decision to choose an implementation platform. The choice of platform depends on the requirements for the use of data and the strategic principles of shaping the architecture of the corporation. Choosing a platform is a complex issue beyond the scope of this book.

In ERwin, a physical model is a graphical representation of a real-world database. The physical database will be made up of tables, columns, and relationships. The physical model depends on the platform chosen for implementation and the requirements for using the data. The physical model for IMS will be very different from that for Sybase. The physical model for OLAP reports will look different from the model for OLTP (online transaction processing).

The data modeler and database administrator (DBA) use the logical model, usage requirements, and corporate architecture policy to develop a physical data model. You can denormalize the physical model to improve performance, and create views to support usage requirements. The following sections detail the process of denormalizing and creating views.

This section provides an overview of the process of building a physical model, collecting data usage requirements, defining the components of a physical model, and providing reverse engineering. In the following publications, these issues are covered in more detail.

Collecting data usage requirements

You typically collect data usage requirements early on during interviews and work sessions. At the same time, the requirements should determine as fully as possible the use of data by the user. The superficial attitude and gaps in the physical model can lead to unplanned costs and delay in project implementation. Requirements for use include:

Access and performance requirements

Volumetric characteristics (an estimate of the amount of data to be stored) that allow the administrator to represent the physical volume of the database

Estimating the number of users who need concurrent access to data to help you design your database for acceptable performance levels

Aggregates, pivots, and other calculated or derived data that can be considered candidates for storage in persistent data structures

Requirements for reporting and standard queries to help the database administrator build indexes

Views (persistent or virtual) that will assist the user when performing data aggregation or filtering operations.

In addition to the chairman, secretary, and users, the modeler, database administrator, and database architect must participate in the usage requirements session. The user's historical data requirements should be discussed. The length of time that data is retained has a significant impact on the size of the database. Often, older data is stored in a generalized form, and atomic data is archived or deleted.

Users should bring examples of requests and reports with them to the session. Reports must be strictly defined and must include atomic values used for any summary and summary fields.

Physical data model components

The components of a physical data model are tables, columns, and relationships. Logical model entities are likely to become tables in the physical model. Boolean attributes become columns. Logical relationships will become constraints on the integrity of the relationships. Some logical relationships cannot be implemented in a physical database.

Reverse engineering

When a logical model is not available, it becomes necessary to recreate the model from the existing database. In ERwin, this process is called reverse engineering. Reverse engineering can be done in several ways. The modeler can explore the data structures in the database and recreate tables in a visual modeling environment. You can import data definitions language (DDL) into a tool that supports reverse engineering (such as Erwin). Advanced tools such as ERwin include functions that provide ODBC communication with an existing database to create a model by directly reading data structures. Reverse engineering with ERwin will be discussed in detail in a future post.

Using corporate functional boundaries

When building a logical model for a modeler, it is important to ensure that the new model is consistent with the corporate model. Using corporate functional boundaries means modeling data in terms used within a corporation. The way data is used in a corporation is changing faster than the data itself. In each logical model, data must be presented in a holistic manner, regardless of the business domain it supports. Entities, attributes, and relationships must define business rules at the corporation level.

NOTE Some of my colleagues refer to these corporate functional boundaries as real-world modeling. Real-world modeling encourages the modeler to view information in terms of its actually inherent relationships and relationships.

The use of corporate functional boundaries for a data model that is constructed appropriately provides the basis for supporting the information needs of any number of processes and applications, which enables the corporation to more efficiently exploit one of its most valuable assets - information.

What is an Enterprise Data Model?

Enterprise data model (EDM) contains entities, attributes, and relationships that represent the information needs of a corporation. EDM is usually categorized according to subject areas, which represent groups of entities related to supporting specific business needs. Some subject areas may cover specific business functions such as contract management, while others may include entities that describe products or services.

Each logical model must correspond to the existing domain of the corporate data model. If the logical model does not meet this requirement, a domain model must be added to it. This comparison ensures that the corporate model is improved or adjusted and that all logical modeling efforts are coordinated within the corporation.

EDM also includes specific entities that define the scope of values for key attributes. These entities have no parents and are defined as independent. Independent entities are often used to maintain the integrity of relationships. These entities are identified by several different names such as code tables, reference tables, type tables, or classification tables. We will use the term "corporate business object". An enterprise business object is an entity that contains a set of attribute values that are independent of any other entity. Corporate business objects should be used consistently within a corporation.

Building a corporate data model by augmenting

There are organizations where the corporate model has been built from start to finish as a result of a single concerted effort. On the other hand, most organizations build fairly complete corporate models by scaling up.

Building up means building something sequentially, layer by layer, just like an oyster grows a pearl. Each data model created provides a contribution to the formation of the EDM. Building an EDM in this way requires additional modeling steps to add new data structures and domains or augment existing data structures. This makes it possible to build an enterprise data model by augmenting, iteratively adding levels of detail and refinement.

Modeling methodology concept

There are several visual data modeling methodologies. ERwin supports two:

IDEF1X (Integration Definition for Information Modeling - an integrated description of information models).

IE (Information Engineering).

IDEF1X is a good methodology and the use of its notation is widespread

Integrated description of information models

IDEF1X is a highly structured data modeling methodology that extends the IDEF1 methodology adopted as a FIPS (Federal Information Processing Standards) standard. IDEF1X uses a highly structured set of modeling construct types and results in a data model that requires an understanding of the physical nature of the data before such information can be made available.

The rigid structure of IDEF1X forces the modeler to assign characteristics to entities that may not correspond to the realities of the surrounding world. For example, IDEF1X requires all entity subtypes to be exclusive. This leads to the fact that a person cannot be both a client and an employee. While real practice tells us differently.

Information engineering

Clive Finklestein is often referred to as the father of information engineering, although similar concepts were shared with him by James Martin (Martin, James. Managing the Database Environment. Upper Saddle River, New Jersey: Prentice Hall, 1983.). Information Engineering uses a business-driven approach to information management and uses a different notation to represent business rules. IE serves as an extension and development of the notation and core concepts of the ER methodology proposed by Peter Chen.

IE provides the infrastructure to support information requirements by integrating corporate strategic planning with information systems that are being developed. This integration allows the management of information resources to be more closely aligned with the long-term strategic prospects of the corporation. This business-driven approach has led many modelers to choose IE over other methodologies that tend to focus on short-term development challenges.

IE proposes a sequence of actions that leads a corporation to identify all of its information needs for collecting and managing data and identifying relationships between information objects. As a result, information requirements are clearly articulated based on management directives and can be directly translated into a management information system that will support strategic information needs.

Conclusion

Understanding how to use a data modeling tool like ERwin is only part of the problem. In addition, you must understand when data modeling tasks are being solved and how the information requirements and business rules that should be represented in the data model are collected. Conducting work sessions provides the most conducive environment for collecting information requirements in an environment that includes domain experts, users, and information technology professionals.

Building a good data model requires analyzing and researching the information requirements and business rules collected through work sessions and interviews. The resulting data model should be compared with the enterprise model, if possible, to ensure that it does not conflict with existing object models and includes all the required objects.

The data model consists of logical and physical models that represent information requirements and business rules. The logical model should be reduced to third normal form. The third normal form constrains, adds, updates, and removes data structure anomalies to support the "one fact in one place" principle. The collected information requirements and business rules should be analyzed and researched. They need to be compared to the enterprise model to ensure they do not conflict with existing object models and include all the required objects.

In ERwin, the data model includes both logical and physical models. ERwin implements the ER approach and allows you to create logical and physical model objects to represent information requirements and business rules. Logical model objects include entities, attributes, and relationships. Physical model objects include tables, columns, and constraints on the integrity of relationships.

One of the following publications will cover the issues of identifying entities, defining entity types, choosing entity names and descriptions, as well as some techniques to avoid the most common modeling errors associated with the use of entities.

Entities must have a complete set of attributes, so that every fact about each entity can be represented by its attributes. Each attribute must have a name that reflects its meaning, a Boolean data type, and a clear, short, complete description or definition. In a future blog post, we will look at an initial set of guidelines for properly formatting attribute names and descriptions. Relationships should include a verb construction that describes the relationship between entities, along with characteristics such as plurality, necessity of existence or possibility of absence of a relationship.

NOTE Plurality relationship describes the maximum number of secondary entity instances that can be associated with an instance of the original entity.Necessity of existence or possibility of absence relationship serves to determine the minimum number of instances of a secondary entity that can be associated with an instance of the original

Increasingly, IT professionals are turning their attention to data management solutions based on industry standard data models and business decision templates. Ready-to-download complex physical data models and business intelligence reports for specific areas of activity allow you to unify the information component of the enterprise and significantly speed up the execution of business processes. Solution templates enable service providers to leverage the power of non-standard information hidden in existing systems, thereby reducing project lead times, costs and risks. For example, real-world projects show that data model and business decision templates can reduce development effort by 50%.

An industry logic model is a domain-specific, integrated, and logically structured view of all the information that must reside in an enterprise data warehouse to answer both strategic and tactical business questions. The main purpose of models is to facilitate orientation in the data space and help in highlighting the details that are important for business development. In modern conditions, for a successful business, it is imperative to have a clear understanding of the links between the various components and to have a good idea of the overall picture of the organization. Identification of all details and relationships using models allows the most efficient use of the time and tools for organizing the work of the company.

An example of a “GIS for government and local government” model.

Today, it is strategically important for software and service providers to be able to quickly respond to changes in the industry associated with technological innovations, the removal of government restrictions and the complication of supply chains. Along with the changes in the business model, the complexity and cost of the information technology required to support a company's operations increases. Data management is especially difficult in an environment where corporate information systems, as well as functional and business requirements for them, are constantly changing.

Industry data models are intended to help facilitate and optimize this process, in transferring the IT approach to the modern level.

Industry data models from the companyEsri

Esri ArcGIS data models are working templates for use in GIS projects and for creating data structures for different application areas. Data model building involves creating a conceptual design, logical and physical structure, which can then be used to build a personal or corporate geodatabase. ArcGIS provides tools for creating and managing a database schema, and Data Model Templates are used to quickly launch a GIS project across a variety of applications and industries. Esri has spent a significant amount of time with the user community to develop a range of templates that can provide a quick start to the design of an enterprise geodatabase. These projects are described and documented at support.esri.com/datamodels. Below, in the order in which they appear on this site, is a semantic translation of Esri's industry model names:

Address register
Agriculture
Meteorology
Basic spatial data
Biodiversity
Internal space of buildings
Greenhouse gas accounting
Maintaining administrative boundaries
Military establishment. Intelligence service
Energy (including the new ArcGIS MultiSpeak protocol)
Ecological structures
Ministry of Emergency Situations. Fire brigade
Forest inventory
Forestry
Geology
National level GIS (e-gov)
Groundwater and waste water
Health care
Archeology and conservation of memorial sites
National security
Hydrology
International Hydrographic Organization (IHO). S-57 format for ENC
Irrigation
Land Registry
Municipal government
Nautical navigation
State cadastre
Oil and gas structures
Pipelines
Raster storage
Bathymetry, seabed relief
Telecommunications
Transport
Water supply, sewerage, housing and communal services

These models contain all the necessary features of the industry standard, namely:

are freely available;
are not tied to the technology of the “chosen” manufacturer;
created as a result of the implementation of real projects;
created with the participation of industry experts;
designed to provide information interaction between various products and technologies;
do not contradict other standards and regulations;
used in completed projects around the world;
are designed to work with information throughout the entire life cycle of the system being created, and not the project itself;
expandable according to the needs of the customer without losing compatibility with other projects and / or models;
accompanied by additional materials and examples;
used in guidelines and technical materials of various industrial companies;
a large community of participants, while access to the community is open to everyone;
a large number of references to data models in publications in recent years.

Esri is part of an expert group of independent bodies that recommend various industry models, such as PODS (Pipeline Open Data Standards - an open standard for the oil and gas industry; PODS is currently being implemented as an Esri PODS Esri Spatial 5.1.1 geodatabase) or a geodatabase (GDB) from ArcGIS for Aviation, which takes into account ICAO and FAA recommendations, as well as the AIXM 5.0 navigation data exchange standard. In addition, there are recommended models that strictly adhere to existing industry standards, such as S-57 and ArcGIS for Maritime (marine and coastal features), as well as models created from the work performed by Esri Professional Services and are de facto standards in the corresponding area. For example, GIS for the Nation and Local Government influenced NSDI and INSPIRE standards, and Hydro and Groundwater (hydrology and groundwater) are heavily used in the freely available professional ArcHydro suite and commercial products. third parties. It should be noted that Esri also supports de-facto standards such as NHDI. All proposed data models are documented and ready for use in enterprise IT processes. Accompanying materials for models include:

UML diagrams of relationships of entities;
data structures, domains, directories;
ready-made geodatabase templates in ArcGIS GDB format;
sample data and sample applications;
examples of data loading scripts, examples of analysis utilities;
reference books on the proposed data structure.

Esri brings together its experience in building industry models in the form of books and localizes published material. The following books have been localized and published by Esri CIS:

Geospatial Service Oriented Architecture (SOA);
Designing geodatabases for transport;
Corporate geographic information systems;
GIS: new energy for electrical and gas enterprises;
Oil and gas on a digital map;
Modeling our world. Esri Geodatabase Design Guide;
Thinking about GIS. GIS Planning: A Manual for Managers;
Geographic information systems. Basics;
GIS for administrative and economic management;
Web GIS. Principles and Applications;
Systems Design Strategies, 26th Edition;
68 issues of ArcReview magazine with publications of companies and users of GIS systems;
... and many other thematic notes and publications.

For example, the book " Modeling our world ..."(translation) is a comprehensive guide and reference for GIS data modeling in general, and geodatabase data model in particular. The book shows how to come up with the right data modeling decisions, decisions that are involved in every aspect of a GIS project, from database design to data and data collection to spatial analysis and visualization Describes in detail how to design a geographic database appropriate for the project, configure database functionality without programming, manage workflow in complex projects, model various network structures such as river, transport or electrical networks, integrate satellite imagery into the process of geographic analysis and display, and create 3D models of GIS data. Book " Designing geodatabases for transportation"contains methodological approaches that have been tested on a large number of projects and fully comply with the legislative requirements of Europe and the United States, as well as international standards. And in the book" GIS: New Energy for Electric and Gas Plants"Using real-world examples, it shows the benefits that corporate GIS can bring to the energy supplier, including aspects such as customer service, network operations and other business processes.

Some of the books, translated and original, published in Russian by Esri CIS and DATA +. They address both conceptual issues related to GIS technology and many applied aspects of modeling and deploying GIS of various sizes and purposes.

We will consider the application of industry models using the example of the BISDM (Building Interior Space Data Model, information model of the internal space of a building) version 3.0. BISDM is a development of a more general BIM (Building Information Model) model and is intended for use in the design, construction, operation and decommissioning of buildings and structures. Used in GIS software, it allows you to efficiently exchange geodata with other platforms and interact with them. Refers to the general group of FM tasks (organization infrastructure management). Let's list the main advantages of the BISDM model, the use of which allows:

organize the exchange of information in a heterogeneous environment according to uniform rules;
get a "physical" embodiment of the BIM concept and recommended rules for construction project management;
to maintain by means of GIS a single repository throughout the entire life cycle of a building (from design to decommissioning);
coordinate the work of various specialists in the project;
visualize the planned schedule and construction stages for all participants;
give a preliminary estimate of the cost and construction time (4D and 5D data);
monitor the progress of the project;
ensure high-quality operation of the building, including maintenance and repairs;
become a part of the asset management system, including the functions of analyzing the efficiency of the use of space (leasing, warehouse, employee management);
calculate and manage energy efficiency objectives of the building;
simulate the movement of human flows.

BISDM defines the rules for working with spatial data at the level of the internal premises in a building, including the purpose and types of use, laid communications, installed equipment, accounting for repairs and maintenance, logging incidents, and relationships with other company assets. The model helps to create a unified repository of geographic and non-geographic data. The experience of the world's leading companies was used to isolate entities and model at the geodatabase (geodatabase) level of the spatial and logical relationships of all physical elements that form both the building itself and its internal premises. Following the principles of BISDM can significantly simplify the tasks of integration with other systems. The first stage is usually CAD integration. Then, during the operation of the building, data exchange with ERP and EAM systems (SAP, TRIRIGA, Maximo, etc.) is used.

Visualization of BISDM structural elements using ArcGIS.

In the case of using BISDM, the customer / owner of the facility receives an end-to-end exchange of information from the idea of creating an object to the development of a complete project, control of construction with obtaining relevant information by the time the facility is put into operation, control of parameters during operation, and even during reconstruction or decommissioning of the facility. Following the BISDM paradigm, GIS and the geographic database created with its help become a common data repository for related systems. Often, the GDB contains data created and operated by third-party systems. This must be taken into account when designing the architecture of the system being created.

At a certain stage, the accumulated "critical mass" of information allows you to move to a new quality level. For example, upon completion of the design phase of a new building, it is possible to automatically visualize 3D survey models in GIS, compile a list of installed equipment, calculate the mileage of utilities to be laid, perform a number of checks and even give a preliminary financial estimate of the project cost.

Once again, we note that when BISDM and ArcGIS are used together, it becomes possible to automatically build 3D models from the accumulated data, since the geodatabase contains a complete description of the object, including z-coordinates, floor membership, types of element connections, equipment installation methods, material, available paths movement of personnel, functional purpose of each element, etc. etc. It should be noted that after the initial import of all design materials into the BISDM GDB, there is a need for additional information content for:

placement of 3D models of objects and equipment in designated places;
collecting information on the cost of materials and the procedure for their laying and installation;
control of permeability according to the dimensions of the installed non-standard equipment.

Due to the use of ArcGIS, it is easier to import additional 3D objects and references from external sources, because ArcGIS Data Interoperability allows you to create procedures for importing such data and placing it correctly within the model. All formats used in the industry are supported, including IFC, AutoCAD Revit, Bentlye Microstation.

Industry data models from IBM

IBM provides a set of storage management tools and models for a variety of business areas:

IBM Banking and Financial Markets Data Warehouse (finance)
IBM Banking Data Warehouse
IBM Banking Process and Service Models
IBM Health Plan Data Model (healthcare)
IBM Insurance Information Warehouse (insurance)
IBM Insurance Process and Service Models
IBM Retail Data Warehouse (retail)
IBM Telecommunications Data Warehouse (telecommunications)
InfoSphere Warehouse Pack:
- for Customer Insight (for understanding customers)
- for Market and Campaign Insight (for understanding the company and the market)
- for Supply Chain Insight (for understanding suppliers).

For example, the model IBMBankingandFinancialMarketsDataWarehouse is designed to address the specific problems of the banking industry in terms of data, and IBMBankingProcessandServiceModels- in terms of processes and SOA (Service Oriented Architecture). For the telecommunications industry, models are presented IBMInformationFrameWork (IFW) and IBMTelecommunicationsDataWarehouse (TDW)... They help to significantly speed up the process of creating analytical systems, as well as reduce the risks associated with the development of business intelligence applications, corporate data management and organization of data warehouses, taking into account the specifics of the telecommunications industry. The capabilities of IBM TDW cover the entire spectrum of the telecommunications market - from Internet providers and cable network operators offering wire and wireless telephony services, data transmission and multimedia content, to multinational companies providing telephone, satellite, long distance and international communication services, as well as organizations global networks. Today TDW is used by large and small wire and wireless service providers around the world.

A tool called InfoSphere Warehouse Pack for Customer Insight provides structured and easily deployable business content for a growing number of business projects and industries, including banking, insurance, finance, health insurance, telecommunications, retail and distribution. For business users InfoSphere Warehouse Pack for Market and Campaign Insight helps to maximize the efficiency of market analysis activities and marketing campaigns through a step-by-step process of developing and taking into account the specifics of the business. By using InfoSphere Warehouse Pack for Supply Chain Insight organizations have the ability to receive current information on supply chain operations.

Esri's position within the IBM solution architecture.

Particularly noteworthy is the IBM approach to utilities and utilities. To meet the growing demands of consumers, utilities need a more flexible architecture than those in use today, as well as an industry standard object model to facilitate the free flow of information. This will increase the communication capabilities of utilities by enabling more cost-effective interoperability and provide new systems with better visibility of all the resources needed, no matter where they are located within the organization. The basis for this approach is SOA (Service Oriented Architecture), a component model that maps the functions of departments and services of various applications that can be reused. The "services" of such components exchange data through interfaces without rigid binding, hiding from the user all the complexity of the systems behind them. In this mode, enterprises can easily add new applications regardless of the software vendor, operating system, programming language, or other intrinsic characteristics of the software. Based on SOA, the concept is being implemented SAFE ( Solution Architecture for Energy), it allows the utility company to obtain a standards-based, holistic view of its infrastructure.

Esri ArcGIS® is a globally recognized software platform for geographic information systems (GIS), which provides the creation and management of digital assets of electric power, gas transmission, distribution, and telecommunication networks. ArcGIS allows you to carry out the most complete inventory of the components of the electrical distribution network, taking into account their spatial location. ArcGIS dramatically extends the IBM SAFE architecture by providing the tools, applications, workflows, analytics, and information integration capabilities needed to manage a smart energy enterprise. ArcGIS within the framework of IBM SAFE allows you to receive information from various sources about infrastructure facilities, assets, customers and employees with accurate data about their location, as well as create, store and process georeferenced information about enterprise assets (supports, pipelines, wires, transformers, cable ducts etc.). ArcGIS within the SAFE infrastructure dynamically connects core business applications by combining data from GIS, SCADA, and customer service systems with external information such as traffic intensity, weather conditions, or satellite imagery. Utilities use this combined information for a variety of purposes, from S.O.R. (the overall picture of the operational environment) to site inspection, maintenance, network analysis and planning.

The information components of a utility company can be modeled using several levels that range from the lowest, physical, to the highest, most complex level of business logic. These layers can be integrated to meet typical industry requirements such as automated measurement logging and supervisory control and data acquisition (SCADA) management. By building the SAFE architecture, utilities are making significant strides in promoting an industry-wide open object model called the Common Information Model (CIM) for Energy and Utilities. This model provides the necessary framework for moving many enterprises towards a service-oriented architecture as it encourages the use of open standards for structuring data and objects. Due to the fact that all systems use the same objects, the confusion and inelasticity associated with different implementations of the same objects will be reduced to a minimum. Thus, the definition of the client object and other important business objects will be unified across all systems of the power supply company. Now, with CIM, service providers and service consumers can share a common data structure, making it easier to outsource high-value business components as CIM establishes a common base on which to build information exchange.

Conclusion

Comprehensive industry data models provide companies with a single, integrated view of their business information. Many companies find it difficult to integrate their data, although this is a prerequisite for most enterprise-wide projects. According to a study by The Data Warehousing Institute (TDWI), more than 69% of organizations surveyed found integration to be a significant barrier to new applications adoption. On the contrary, the implementation of data integration brings the company tangible income and increased efficiency.

A well-constructed model uniquely identifies the meaning of the data, which in this case is structured data (as opposed to unstructured data such as an image, binary file, or text, where the meaning can be ambiguous). Most effective are the industry models offered by professional vendors such as Esri and IBM. The high return on using their models is achieved due to the significant level of detail and accuracy. They usually contain many data attributes. In addition, both Esri and IBM have extensive modeling experience and are well versed in building industry-specific models.

DB architecture

The KMD schema is a description of the structure of the data model from the point of view of the administrator.

An AMD schema is a description of an internal or physical model. This is where the description of the physical location of the data on the media is stored. The schema stores direct indications of the location of data in memory (volumes, disks).

The KMD schema describes the structure of data, records and fields.

All DBMSs support three main types of data models:

1. Hierarchical model. It assumes some kind of root entry. Branches come from the roots.

Not all objects are conveniently described in this way. There are no links in the hierarchy and a great redundancy of information is characteristic.

2. Network model. Allows you to correctly display all the complexities of relationships.

The model is convenient for representing links with data from the external environment, but less convenient for describing in a database, which leads to additional work for the user to study the navigation through links.

3. The relational model. It is based on the mathematical term Relation - a relation, and simply - a table. For example, rectangular two-dimensional.

The relational data structure was developed in the late 1960s by a number of researchers, of whom the most significant contribution was made by IBM employee Edgar Codd. With the relational approach, data is presented in the form of two-dimensional tables - the most natural for humans. At the same time, for data processing, Codd suggested using the apparatus of set theory - union, intersection, difference, Cartesian product.

Data type- this concept has the same meaning as in programming languages (i.e., the data type determines the internal representation in the computer memory and the way of storing the data instance, as well as the set of values that the data instance can take and the set of valid data operations). All existing modern databases support special data types designed to store data of integer type, fractional floating point, characters and strings, calendar dates. Many database servers have other types implemented, for example, the Interbase server has a special data type for storing large arrays of binary information (BLOB).

Domain Is a potential set of values of a simple data type, it resembles a data subtype in some programming languages. A domain is defined by two elements - a data type and a boolean expression that is applied to the data. If this expression evaluates to true, then the data instance belongs to the domain.

Attitude Is a two-dimensional table of a special kind, consisting of a header and a body.

Heading Is a fixed set of attributes, each of which is defined on some domain, and there is a one-to-one correspondence between the attributes and the defining domains.

Each of the attributes is defined on its own domain. The domain is the integer data type, and the boolean condition is n> 0. The heading is unchanged over time, in contrast to the body of the relationship. Relationship body Is a collection tuples, each of which is an attribute-value pair.

The power of the relationship is the number of its tuples, and degree of attitude- the number of attributes.

The degree of the ratio is constant for a given ratio, while the power of the ratio changes over time. The power of the relationship is also called the cardinal number.

The above concepts are theoretical and are used in the development of language tools and software systems of relational DBMS. In everyday work, their informal equivalents are used instead:

attitude - table;

attribute - column or field;

tuple - record or string.

Thus, the degree of the ratio is the number of columns in the table, and the cardinal number is the number of rows.

Since a relation is a set, and in classical set theory, by definition, a set cannot contain coinciding elements, a relation cannot have two identical tuples. Therefore, for a given relationship, there is always a set of attributes that uniquely identify a tuple. This set of attributes is called key.

The key must meet the following requirements:

· Must be unique;

· Must be minimal, that is, removing any attribute from the key leads to a violation of uniqueness.

As a rule, the number of attributes in the key is less than the degree of the relationship, however, as a last resort, the key can contain all the attributes, since the combination of all attributes satisfies the uniqueness condition. Typically, a relationship has multiple keys. Of all the keys of the relation (they are also called "possible keys"), one is chosen as primary key... When choosing primary key the key with the fewest attributes is usually preferred. It is also impractical to use keys with long string values.

In practice, a special numeric attribute is often used as a primary key - an auto-incremental zero, the value of which can be generated by a trigger (a trigger is a special procedure that is called when changes are made to the database) or by special means defined in the DBMS engine.

The basic concepts described in this chapter are not specific to any particular database implementation, but are common to all of them. Thus, these concepts are the basis of a certain general model, which is called the relational data model.

The founder of the relational approach, Date established that the relational model has three parts:

· Structural;

· Manipulative;

· Holistic.

In the structural part of the model, relations are fixed as the only data structure used in the relational model.

In the manipulation part, two basic mechanisms for manipulating relational bases are fixed - relational algebra and relational calculus.

An integral part is understood as a certain mechanism to ensure the non-destructibility of data. The integral part encompasses two basic requirements for the integrity of relational databases - entity integrity and referential integrity.

Requirement entity integrity is that any tuple of any relation must be distinguishable from any other tuple of this relation, that is, in other words, any relation must have a primary key. This requirement must be met if the basic properties of the relationship are met.

In the data manipulation language, as well as in the query language, a mathematical apparatus called the algebra of relations is executed, for the following actions are defined:

1. Standard operations: - intersection, - union, \ - difference, X - Cartesian product.

2. Specific: projection, limitation, connection, division.

a. Union.

ShD SHM EI NR

R 1 (part number, material number, unit of measure, consumption rate)

R 2 (ШД, ШМ, ЕИ, НР)

Need to find

Attachment of the sets R 1 and R 2 is assumed. In this operation, the degree is conserved, and the cardinality of the result set

b. Intersection.

Highlight matching lines.

c. Difference.

Eliminating tuples from R 1 that coincide with R 2.

d. Cartesian product.

This is where tuples are concatenated.

Each line of one set concatenates with each line of the other.

Two sets are given:

The Cartesian product is as follows:

In this case, the S-degree is equal to, and, i.e. you get 12 rows and 5 columns.

The corporate database is the central link of the corporate information system and allows you to create a single information space for the corporation. Corporate databases

Share your work on social media

If this work did not suit you at the bottom of the page there is a list of similar works. You can also use the search button

TOPIC V. CORPORATE DATABASES

V .1. Organization of data in corporate systems. Corporate databases.

V .2. DBMS and structural solutions in corporate systems.

V .3. Internet / Intranet technologies and corporate solutions for database access.

V .1. ORGANIZATION OF DATA IN CORPORATE SYSTEMS. CORPORATE DATABASES

Corporate base data is the central link of the corporate information system and allows you to create a single information space for the corporation. Corporate databases (Figure 1.1).

There are various definitions of databases.

The term database can be summarized as a collection of logically related data intended for sharing.

Under the database is understood as a set of data stored together with such a minimum redundancy that allows them to be used in an optimal way for one or more applications.

Basic requirements for databases:

Completeness of data presentation. The data in the database should adequately represent all the information about the object and they should be sufficient for ODS.
Database integrity. The data must be saved when processing their ODS and in any situations that arise during the work.
Data structure flexibility. The database should allow changing data structures without violating its integrity and completeness when external conditions change.
Feasibility. This means that there must be an objective representation of various objects, their properties and relationships.
Availability. It is necessary to provide delimitation of access to data.
Redundancy. The database should have a minimum redundancy in the representation of data about any object.

Knowledge means a set of facts, patterns and heuristic rules that can be used to solve the problem.

Knowledge base (KB)  a set of databases and used rules obtained from decision-makers. The knowledge base is an element of expert systems.

Distinguish different ways of presenting data.

Physical data - it is data stored in the computer memory.

Rice. 1.1. The structure of the interaction of departments with the information resources of the corporation.

Corporate databases are focused (centralized) and distributed.

Figure 1.2. Scheme heterogeneous centralized database

Large flow of data exchange;
High traffic on the network;
Low reliability;
Poor overall performance.

Higher degree of simultaneity of processing due to load balancing;
Improving the use of data in the field when performing remote (remote) queries;
Lower costs;
Ease of managing local databases.

Figure 1.3. Distributed corporation database.

Let's give the following definition of a distributed database.

The most important performance requirements for a distributed database are:

Scalability;
Compatibility;
Support for various data models;
Portability;
Location transparency;
Autonomy of nodes of a distributed database (Site Autonomy);
Distributed request processing;
Execution of distributed transactions.
Support for a homogeneous security system.

The main task of a distributed database - distribution of data over the network and providing access to it. There are the following ways to solve this problem:

Each node stores and uses its own dataset that is available for remote queries. This distribution is divided.
Some data frequently used at remote sites may be duplicated. This distribution is called partially duplicated.
All data is duplicated at each node. This distribution is called fully duplicated.
Some files can be split horizontally (a subset of records is selected) or vertically (a subset of attribute fields is selected), while the selected subsets are stored in different nodes along with unsplit data. This distribution is called split (fragmented).

When creating a distributed database, at the conceptual level, you have to solve the following tasks:

It is necessary to have a single conceptual diagram of the entire network. This will provide logical transparency of data for the user, as a result of which he will be able to form a request to the entire database, being behind a separate terminal (it seems to work with a centralized database).
A schema is needed to locate the data on the network. This will provide transparency of data placement, thanks to which the user does not have to specify where to send the request to get the required data.
It is necessary to solve the problem of heterogeneity of distributed databases. Distributed databases can be homogeneous or heterogeneous in terms of hardware and software. The problem of heterogeneity is relatively easy to solve if the distributed database is heterogeneous in the sense of hardware, but homogeneous in the sense of software (the same DBMS in the nodes). If different DBMSs are used in the nodes of a distributed system, means of transforming data structures and languages are required. This should provide transparency of transformation across the nodes of the distributed database.
It is necessary to solve the problem of dictionary management. To provide all kinds of transparency in a distributed database, you need programs that manage numerous dictionaries and reference books.
You need to define methods for executing queries in a distributed database. The methods for executing queries in a distributed database differ from those in centralized databases, since individual parts of the queries need to be executed at the location of the relevant data and the partial results must be passed to other nodes; at the same time, coordination of all processes must be ensured.
It is necessary to solve the problem of parallel query execution. A distributed database requires a sophisticated concurrency control mechanism, which, in particular, must ensure synchronization when information is updated, which ensures data consistency.
A developed methodology for the distribution and placement of data is required, including splitting, is one of the main requirements for a distributed database.

increase the salary of all employees of the company;
calculate the bank interest on the accounts of all clients;
make changes to the list of all goods in stock;
find the required abstract from all texts stored in the library or in the bibliographic information retrieval system;
find a description of the required contract in a file containing legal documents;
look through all files containing descriptions of patents and find a patent (if any) similar to the proposed one again.

To implement the database engine, parallel and associative architecture as an alternative to uniprocessorvon Neumannstructure, allowing to work with large amounts of information in real time.

The main goal of DW is creation of a single logical representation of data contained in different types of databases, or, in other words, a single corporate data model.

A data bank may be present in the structure of a corporation.

Database is considered as an information and reference system, the main purpose of which is:

in the accumulation and maintenance in working order of a set of information that constitutes the information base of the entire automated system or a certain set of tasks solved in it;
in the issuance of the data required by the task or user;
in providing collective access to stored information;
in ensuring the necessary management of the use of information contained in the information base.

Thus, a modern databank is a complex software and hardware complex, which includes technical, system and network tools, databases and DBMS, information retrieval systems for various purposes.

V .2. DBMS AND STRUCTURAL SOLUTIONS IN CORPORATE SYSTEMS

Database and knowledge management systems

An important component of modern information systems are database management systems (DBMS).

DBMS - a set of software and language tools intended for the creation, maintenance and use of databases.

The main feature of modern DBMS is that modern DBMS support technologies such as:

Client / server technology.
Database languages support. itschema definition language DB (SDL - Schema Definition Language),data manipulation language (DML), integrated languages SQL (Structured Queue Language), QDB (Query - By - Example) and QMF (Query Management Facility ) Is an advanced peripheral query specification and reporting tool for DB 2, etc .;
Direct data management in external memory.
Management of RAM buffers.
Transaction management. OLTP - technology (On-Line Transaction Processing), OLAP - technology (On-Line Analysis Processing) for DW.
Ensure data protection and integrity. The use of the system is allowed only to users who have the right to access the data. When users perform operations on data, the consistency of the stored data (integrity) is maintained. This is important in corporate multi-user information systems.
Journalization.

Modern DBMS must ensure compliance with the database requirements listed above. In addition, they must comply with the following principles:

Data independence.
Versatility. The DBMS must have powerful conceptual data model support for displaying custom logical views.
Compatibility. The DBMS must remain operational with the development of software and hardware.
Redundancy of data. Unlike file systems, a database must be a single collection of integrated data.
Data protection. The DBMS must provide protection against unauthorized access.
Data integrity. The DBMS must prevent users from breaking the database.
Management of simultaneous work. The DBMS must protect the database from inconsistencies in the shared access mode. To ensure a consistent state of the database, all user requests (transactions) must be executed in a specific order.
The DBMS must be universal. It should support different data models on a single logical and physical basis.
The DBMS must support both centralized and distributed databases and, thus, become an important link in computer networks.

their capabilities in relation to distributed (corporate) databases;
their relationship to the type of the data model implemented in the DBMS.

In relation to corporate (distributed) databases, the following types of DBMS can be conventionally distinguished:

"Desktop" DBMS. These products are primarily focused on working with personal data ("desktop" data). They have command sets for sharing common databases, but small in size (like a small office). First of all, it is a DBMS such as Assess, dBASE, Paradox, EohPgo. Why Assess, dBASE, Paradox, EohPgo have poor access to corporate data. The point is, there is no easy way to overcome the barrier between personal and corporate data. And the point is not even that the mechanism of the personal data DBMS (or small office) is focused on accessing data through many gateways, internetworking products, etc. The problem is that these mechanisms are usually associated with full file transfers and the lack of forked index support, which results in server queues practically stalling on large systems.
Specialized high-performance multi-user DBMS. Such DBMSs are characterized by the presence of a multiuser system kernel, a data manipulation language and the following functions typical for developed multiuser DBMSs:
organization of the buffer pool;
the presence of a system for processing queues of transactions;
the presence of mechanisms for multi-user data locking;
transaction logging;
the availability of access control mechanisms.

These are DBMS like Oracle, DB2, SQL / Server, Informix, Sybase, ADABAS, Titanium and others provide a wide service for processing corporate databases.

When working with databases, the transaction mechanism is used.

Transaction Is a logical unit of work.

Transaction is a sequence of data manipulation statements executedas a whole(all or nothing) and translating databasefrom one holistic state to another holistic state.

A transaction has four important properties known as ASID properties:

(A) Atomicity ... A transaction is executed as an atomic operation - either the entire transaction is executed, or it is not executed entirely.
(C) Consistency... A transaction moves a database from one consistent (consistent) state to another consistent (consistent) state. Within a transaction, database consistency can be violated.
(I) Insulation ... Transactions of different users should not interfere with each other (for example, as if they were executed strictly in turn).
(E) Durability... If the transaction is completed, then the results of its work should be saved in the database, even if the next moment the system crashes.

The transaction usually starts automatically from the moment the user connects to the DBMS and continues until one of the following events occurs:

Command COMMIT WORK issued.
The ROLLBACK WORK command was issued.
The user has disconnected from the DBMS.
There was a failure of the system.

Client / server architecture

		Database
Server computer
Network
IBM Compatible PC	IBM Compatible PC	IBM Compatible PC
		Applications

Rice. 2.1. Client-server architecture system

The main function of the client computer is to execute the application (user interface and presentation logic) and communicate with the server when required by the application.

Server Is an object (computer) that provides services to other objects at their request.

Distributed computing

The term "distributed computing" is often used to refer to two different, albeit complementary, concepts:

Distributed database;
Distributed data processing.

The application of these concepts makes it possible to organize access to information stored on multiple machines for end users using different means.

There are many types of servers:

Database server;
Print server;
Remote access server;
Fax server;
Web server, etc.

At the heart of the underlying technology is Client / Server are such basic technologies as:

Operating system technologies, the concept of interaction of open systems, the creation of object-oriented environments for the functioning of programs;
Telecommunication technologies;
Network technologies;
Graphical User Interface Technologies ( GUI);
Etc.

Advantages of client-server technology:

Client / server technology allows computing on heterogeneous computing environments. Platform independence: Access to heterogeneous network environments that include different types of computers with different operating systems.
Independence from data sources: access to information from heterogeneous databases. Examples of such systems are DB2, SQL / DS, Oracle, Sybase.
Load balance between client and server.
Perform computation where it is most efficient;
Provide the ability to efficiently scale;
Cross-platform computing... Cross-platform computing is simply defined as the implementation of technologies in heterogeneous computing environments. The following possibilities should be provided here:
The application must run on multiple platforms;
On all platforms, it must have the same interface and logic of work;
The application must integrate with the native operating environment;
It should behave the same on all platforms;
Simple and consistent support should be provided for it.

 Distributed computing. Distributed computing involves the distribution of work among several computers (although distributed computing is a broader concept).

 Downsizing. Downsizing is the porting of mainframe applications to small computer platforms.

Reduced infrastructure and hardware costs. Cost-effective: The availability of low-cost computing equipment and the increasing proliferation of local area networks make client-server technology more cost-effective than other data processing technologies. The equipment can be upgraded as soon as the need arises.

 Reducing the overall execution time of the application;

 Reducing client memory usage;

 Reducing network traffic.

Ability to work with multimedia: to date, many multimedia programs have been created for the PC. There are either no such programs for the terminal-host configuration, or they are very expensive.
The ability to attract large computing resources for database operations: since applications are executed on client computers, additional (compared to the terminal-host configuration) resources are freed up on the server computer for database operations, such as computing resources of the central processor and operational memory.
Better Programmer Productivity: Programmer productivity is increased by using tools such as SQL * Forms and CASE, which allow you to develop applications faster than programming languages such as C, PL1, or COBOL.
Increased End User Productivity: By now, many end users have mastered systems such as Lotus, Paradox, Word Perfect, Harvard Graphics, and more.

The server-side interface is defined and fixed. Therefore, it is possible to create new client parts of an existing system (an example of interoperability at the system level).

Rice. 2.2. Illustration of client access to a server share.

How to implement client-server technology

The installation of a system based on client-server technology and capable of performing distributed data processing is discussed below. The following computer hardware and software is required:

database server computer;
client computers;
communication network;
network software;
application software.

Rice. 2.3. Scheme for executing user queries to distributed databases.

The data model has three components:

The data structure to represent from the user's point of view of the database.
Valid operations performed on the data structure. It is necessary to be able to work with this structure using various operations of NOD and NAM. A rich structure is worthless if there is no way to manipulate its contents.
Integrity control constraints. The data model should be provided with means to maintain its integrity and protect it. As an example, consider the following two constraints:
Each subtree must have a source node. Hierarchical databases cannot store child nodes without a source node.
With respect to a relational database, there cannot be identical tuples. For a file, this requirement requires all records to be unique.

One of the most important characteristics of a DBMS is the ability to link objects.

There are the following types of links between objects:

One-to-One (1: 1)... One object of one set can be associated with one object of another set.
One-to-Many (1: M)... One object of one set can be associated with many objects of another set.
Many-to-Many (M: N)... One object of one set can be associated with many objects of another set, but at the same time one object of another set can be associated with many objects of the first set.
Ramified ... One object of one set can be associated with objects of many sets.
Recursive ... One object of a given set can be linked by an object of the same set.

The following basic data models exist:

Relational data model.
Hierarchical data model.
Incomplete network data model.
CODASYL data model.
Extended network data model.

V .3. INTERNET / INTRANET TECHNOLOGIES AND CORPORATE DATABASE ACCESS SOLUTIONS

Other similar works that may interest you. Wshm>
6914.		Database concept	11.56 KB
	The database is presented in an objective form, a set of independent materials of articles of calculations of normative acts of court decisions and other similar materials systematized in such a way that these materials can be found and processed using an electronic computer Civil Code of the Russian Federation Art. A database organized in accordance with certain rules and maintained in the computer memory is a set of data characterizing the current state of some ...
8064.		Distributed databases	43.66 KB
	Distributed databases A distributed database RDB is understood as a set of logically interconnected shared data that are physically distributed across different nodes of a computer network. Data access should not depend on the presence or absence of data replicas. The system should automatically determine the methods for performing the data fusion connection, the network channel is able to cope with the amount of transmitted information and the node has sufficient processing power to join the tables. The RDBMS must be capable of ...
20319.		DATABASES AND THEIR PROTECTION	102.86 KB
	Online online databases emerged in the mid-1960s. Operations on operational databases were processed interactively using terminals. Simple index-sequential record organizations quickly evolved to a more powerful set-oriented record model. Charles Bachmann received the Turing Prize for leading the Data Base Task Group (DBTG), which developed a standard language for data description and data manipulation.
5031.		Database Development Library	11.72 MB
	Database design technology. Determining relationships between entities and creating a data model. The main ideas of modern information technology are based on the concept according to which data should be organized into databases in order to adequately reflect the changing real world and meet the information needs of users. These databases are created and operate under the control of special software systems called database management systems DBMS.
13815.		HIERARCHICAL DATABASE MODEL	81.62 KB
	The main ideas of modern information technology are based on the concept of databases, according to which the basis of information technology is data organized in databases that adequately reflect the state of a particular subject area and provide the user with relevant information in this subject area. It must be recognized that the data is ...
14095.		Library database development	11.72 MB
	The increase in the volume and structural complexity of the stored data, the expansion of the circle of users of information systems have led to the widespread use of the most convenient and relatively easy to understand relational (tabular) DBMS.
5061.		Polyclinic database creation	2.4 MB
	The development of computer technology and information technology has provided opportunities for the creation and widespread use of automated information systems (AIS) for various purposes. Information systems for managing economic and technical facilities are being developed and implemented
13542.		Geological information databases	20.73 KB
	Recently, the introduction of computer technologies and, in particular, databases, into the scientific field has been rapidly going on. This process does not bypass geology either, since it is in the natural sciences that there is a need for storing and processing large amounts of information.
9100.		Database. Basic concepts	26.28 KB
	A database is a collection of information about specific objects of the real world in any subject area of economics, management, chemistry, etc. The purpose of an information system is not just storage of data about objects, but also manipulation of this data, taking into account connections between objects. Each object is characterized by a set of properties data, which are called attributes in the database.
5240.		Creation of the database "Dean's office"	1.57 MB
	Database (DB) is a set of interconnected data stored together on external storage media of a computer, with such an organization and minimal redundancy that allows them to be used in an optimal way for one or several applications

Industry data models

Publications

Steve Hoberman. Leveraging the Industry Logical Data Model as Your Enterprise Data Model.
Claudia Imhoff. Fast-Tracking Data Warehousing & Business Intelligence Projects via Intelligent Data Modeling

Zaitsev S.L., Ph.D.

Repeating groups

Rice. 1.6. This example uses repeating groups.

Rice. 1.7. Model reduced to first normal form.

One fact in one place

If the same attribute is present in more than one entity and is not a foreign key, then this attribute is considered redundant. The logical model should not contain redundant data.

Each attribute depends on the key

Rice. 1.9. In third normal form School name and Geographic region transferred to entity, where their values depend on the key.

Many-to-many relationships

Formal definitions of normal forms

The relation R corresponds to the first normal form (1NF) if and only if all domains belonging to it contain only atomic values (Date, ibid.).

A relation R corresponds to second normal form (2NF) if and only if it corresponds to 1NF, and each non-key attribute is completely dependent on the primary key (Date, ibid.).

A relation R corresponds to third normal form (3NF) if and only if it corresponds to 2NF, and each non-key attribute does not transitively depend on the primary key (Date, ibid.).

The relation R corresponds to Boyes-Codd normal form (BCNF) if and only if each determinant is a candidate for use as a key.

NOTE Below is a brief explanation of some of the abbreviations used in Date's definitions.

FD (functional dependency) - functional dependency. With functional dependency, the value of an attribute depends on the value of another attribute that is not part of the primary key.

Business normal forms

The third business normal form (3BNF) takes attributes that are independent of a primary key into another entity, where they are completely dependent on the primary key of that entity.

Completed Logical Data Model

Physical data model

Collecting data usage requirements

Access and performance requirements

Volumetric characteristics (an estimate of the amount of data to be stored) that allow the administrator to represent the physical volume of the database

Estimating the number of users who need concurrent access to data to help you design your database for acceptable performance levels

Aggregates, pivots, and other calculated or derived data that can be considered candidates for storage in persistent data structures

Requirements for reporting and standard queries to help the database administrator build indexes

Views (persistent or virtual) that will assist the user when performing data aggregation or filtering operations.

Users should bring examples of requests and reports with them to the session. Reports must be strictly defined and must include atomic values used for any summary and summary fields.

Physical data model components

Reverse engineering

Using corporate functional boundaries

What is an Enterprise Data Model?

Building a corporate data model by augmenting

Modeling methodology concept

There are several visual data modeling methodologies. ERwin supports two:

IDEF1X (Integration Definition for Information Modeling - an integrated description of information models).

IE (Information Engineering).

IDEF1X is a good methodology and the use of its notation is widespread

Integrated description of information models

Information engineering

Conclusion

NOTE Plurality relationship describes the maximum number of secondary entity instances that can be associated with an instance of the original entity.Necessity of existence or possibility of absence relationship serves to determine the minimum number of instances of a secondary entity that can be associated with an instance of the original

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

1. Relational data model

1.1 The relational data model. Basic definitions
1.2 Operations on relations

2. Corporate information systems
Bibliography

1. Relational data model

1.1 The relational data model. Basic definitions

In mathematical disciplines, the concept of "table" corresponds to the concept of "relation" (relation). The table reflects an object of the real world - an entity, and each of its lines reflects a specific instance of the entity. Each column has a name unique to the table. Strings have no names, their order is not defined, and the number is logically unlimited. One of the main advantages of the relational data model is homogeneity (each row in a table has the same format). It is up to the user to decide whether the respective entities are homogeneous. This solves the problem of model suitability.

Basic concepts:

* A relationship is a two-dimensional table containing some data.

* Entity - an object of any nature, data about which is stored in the database. Attributes are properties that characterize an entity (columns).

* The degree of relationship is the number of columns.

* Relationship schema - a list of attribute names, for example, EMPLOYEE (No., full name, year of birth, position, department).

* Domain - a set of values of the attributes of a relation (data type).

* A tuple is a table row.

* Cardinality (cardinality) - the number of rows in the table.

* Primary key is an attribute that uniquely identifies the rows of a relationship. A multi-attribute primary key is called a composite primary key. The primary key cannot be completely or partially empty (null). Keys that can be used as primary keys are called potential or alternative keys.

* A foreign key is an attribute (s) of one table that can serve as the primary key of another table. References the primary key of another table.

Normalization is a process aimed at reducing the redundancy of information in a database. In addition to the data itself, various names, object names and expressions can also be normalized in the database.

A non-normalized database contains information in one or more different tables; this gives the impression that the inclusion of data in a particular table is not due to any apparent reasons. This state of affairs can have a negative impact on data security, the rational use of disk space, the speed of queries, the efficiency of updating the database and, perhaps most importantly, the integrity of the stored information. The database before normalization is a structure that has not logically been broken down into more manageable, smaller tables yet.

The normal form is a kind of indicator of the level, or depth, of database normalization. The normalization level of the database corresponds to the normal form in which it is located.

1.2 Operations on relations

To bring the table to the first normal form (1NF), two rules must be observed:

1. Atomicity or indivisibility. Each column must contain one indivisible value.

2. The table should not contain duplicate columns or groups of data.

For example, if a table contains in one field the full address of a person (street, city, postal code), it will not meet the 1NF rules, since it will contain different values in one column, which would be a violation of the atomicity rule. Or if the database contains data about movies and it contains the columns Actor1, Actor2, Actor3, it will also not comply with the rules, since the data will be repeated.

Normalization should begin with checking the database structure for compatibility with 1NF. All columns that are not atomic must be split into their constituent columns. If there are duplicate columns in the table, then they need to select a separate table.

To bring the table to first normal form, you should:

* Find all fields that contain multipart pieces of information.

* The data that can be broken down into component parts must be placed in separate fields.

* Move duplicate data into a separate table.

* Check if all tables match the conditions of the first normal form.

To bring the tables to the second normal form (2NF), the tables should be already in 1NF. Normalization should proceed in order.

Now, in second normal form, the condition must be met - any column that is not a key (including foreign) must depend on the primary key. Typically, these columns, which have values that are independent of the key, are easy to identify. If the data contained in the column is not related to the key that describes the row, then they should be separated into their own separate table. The primary key must be returned to the old table.

To bring the base to the second normal form, you need:

* Identify all columns that are not directly dependent on the primary key of this table.

* Create the required fields in the users and forums tables, select from existing fields or create primary keys from new ones.

* Each table needs its own primary key

* Create foreign keys and designate their relationships between tables. The final step of normalization to 2NF will be the allocation of foreign keys for communication with associated tables. The primary key of one table must be a foreign key in another.

Hints:

Another way to convert a schema to 2NF is to look at the relationships between the tables. Ideally, create all one-to-many relationships. Many-to-many relationships need restructuring.

A properly normalized table will never have duplicate rows (two or more rows whose values are not keys and contain the same data).

The database will be in third normal form if it is converted to second normal form and each non-key column is independent of each other. If you follow the normalization process correctly up to this point, there may be no questions about converting to 3NF. You should be aware that 3NF is violated if changing the value in one column requires a change in the other column.

To bring the base to the third normal form, you need:

* Determine which fields of which tables have interdependencies, i.e. fields that depend more on each other than on the row as a whole.

* Create matching tables. If there is a problematic column in step 1, create split tables for it.

* Create or allocate primary keys. Each table must have a primary key.

* Create the required foreign keys that form any of the relationships.

In the fourth normal form, an additional rule is that it is necessary to exclude multivalued dependencies. In other words, all rows in the table must be independent of each other. The presence of some row X should not mean that row Y is also somewhere in this table.

2. Corporate information systems

relational model data system

A system (from the Greek systema - a whole, a compound made up of parts) is a set of elements that interact with each other, forming a certain integrity, unity. Here are some concepts that are often used to characterize a system.

1. A system element is a part of a system that has a specific functional purpose. Complex elements of systems, in turn, consisting of simpler interconnected elements, are often called subsystems.

2. Organization of the system - internal orderliness, consistency of interaction of system elements, manifested, in particular, in limiting the variety of states of elements within the system.

3. The structure of the system - the composition, order and principles of interaction of the elements of the system, which determine the basic properties of the system. If the individual elements of the system are spaced across different levels and the internal connections between the elements are organized only from higher to lower levels and vice versa, then we speak of the hierarchical structure of the system. Purely hierarchical structures are practically rare, therefore, somewhat expanding this concept, the hierarchical structure is usually understood as such structures where, among other connections, hierarchical relationships are of primary importance.

4. System architecture - a set of system properties that are essential for the user.

5. Integrity of the system - the fundamental irreducibility of the properties of the system to the sum of the properties of its individual elements (emergence of properties) and, at the same time, the dependence of the properties of each element on its place and function within the system.

Information system is an interconnected set of means, methods and personnel used to store, process and issue information in order to achieve the set goal "

The Federal Law "On Information, Informatization and Information Protection" provides the following definition:

"Information system is an organizationally ordered set of documents (arrays of documents) and information technologies, including the use of computer technology and communications that implement information processes"

Scale classification

In terms of scale, information systems are divided into the following groups:

* single;

* group;

* corporate.

A corporate information system is a scalable system designed for the integrated automation of all types of economic activities of large and medium-sized enterprises, including corporations consisting of a group of companies requiring unified management.

A corporate information system can be considered a system that automates more than 80% of the divisions of an enterprise.

Recently, in many publications devoted to the use of information technology in the management of economic objects, the term "corporate information systems" is often used, which in them means the actual automated information systems of economic objects.

An automated information system (AIS) is a combination of various types of support, as well as specialists designed to automate the processing of accounting and analytical information. As a rule, the types of support are homogeneous for different systems in composition, which makes it possible to implement the principle of compatibility of systems in the course of their operation. In the process of studying AIS as a complex system, it is necessary to single out individual parts and elements and consider the features of their use at the stages of creation and operation.

Corporate information systems are an evolution of systems for workgroups, they are focused on large companies and can support geographically dispersed nodes or networks. Basically, they have a hierarchical structure of several levels. Such systems are characterized by a client-server architecture with specialization of servers or a multi-tier architecture. When developing such systems, the same database servers can be used as when developing group information systems. However, in large information systems, the most common servers are Oracle, DB2 and Microsoft SQL Server.

For group and corporate systems, the requirements for the reliability of operation and data safety are significantly increased. These properties are maintained by maintaining data, reference, and transactional integrity in the database servers.

Classification by scope

According to the scope of application, information systems are usually divided into four groups:

* transaction processing systems;

* decision making systems;

* information and reference systems;

* office information systems.

Bibliography

1. Agaltsov, V.P. Database. In 2 volumes. V. 2. Distributed and remote databases: Textbook / V.P. Agaltsov. - M .: ID FORUM, NITs INFRA-M, 2013.

2. Golitsyna, O. L. Databases: Textbook / O.L. Golitsyna, N.V. Maksimov, I.I. Popov. - M .: Forum, 2012.

3. Karpova, I.P. Databases: Textbook / I.P. Karpov. - SPb .: Peter, 2013.

4. Kirillov, V.V. Introduction to relational databases. Introduction to relational databases. Kirillov, G.Yu. Gromov. - SPb .: BHV-Petersburg, 2012.

5. Pirogov, V.Yu. Information systems and databases: organization and design: Textbook / V.Yu. Pirogov. - SPb .: BHV-Petersburg, 2009.

6. G.N. Fedorov. Information Systems. - M .: Academy, 2013.

7. A.E. Satunina, L.A. Sysoeva. Project management of the corporate information system of the enterprise. - M .: Finance and statistics, Infra-M, 2009.

Posted on Allbest.ru

...

The enterprise data model includes elements. Relational data model

"The tale is a lie - but there is a hint in it ..."

Debriefing

Antifragility

What is a data warehouse and why are we building it

Key features of data warehouses

Architectural concept

System core

Warehouse Core Model and Enterprise Data Model

Primary data layer (or historicized staging or operational layer)

Showcase layer

Service layer

Designing and Maintaining Warehouse Data Models

Features of data warehouse projects

Data warehouse is custom software

Data warehouse is an integration project

Data Warehouse is a Collaborative Project

The data warehouse has a longer life span than other systems

Gradual iterative development

Data warehouses - "multi-project story"

Balance of innovation and proven solutions

Epilogue

Epilogue

Industry data models

Publications

TOPIC V. CORPORATE DATABASES

Enterprise data model

Introduction

Enterprise data model

Presentation levels of the enterprise data model

TOPIC V. CORPORATE DATABASES

Industry data models

Publications

Send your good work in the knowledge base is simple. Use the form below

1. Relational data model

2. Corporate information systems

relational model data system

A system (from the Greek systema - a whole, a compound made up of parts) is a set of elements that interact with each other, forming a certain integrity, unity. Here are some concepts that are often used to characterize a system.

1. A system element is a part of a system that has a specific functional purpose. Complex elements of systems, in turn, consisting of simpler interconnected elements, are often called subsystems.

2. Organization of the system - internal orderliness, consistency of interaction of system elements, manifested, in particular, in limiting the variety of states of elements within the system.

4. System architecture - a set of system properties that are essential for the user.

5. Integrity of the system - the fundamental irreducibility of the properties of the system to the sum of the properties of its individual elements (emergence of properties) and, at the same time, the dependence of the properties of each element on its place and function within the system.

Information system is an interconnected set of means, methods and personnel used to store, process and issue information in order to achieve the set goal "

The Federal Law "On Information, Informatization and Information Protection" provides the following definition:

"Information system is an organizationally ordered set of documents (arrays of documents) and information technologies, including the use of computer technology and communications that implement information processes"

Scale classification

In terms of scale, information systems are divided into the following groups:

* single;

* group;

* corporate.

A corporate information system is a scalable system designed for the integrated automation of all types of economic activities of large and medium-sized enterprises, including corporations consisting of a group of companies requiring unified management.

A corporate information system can be considered a system that automates more than 80% of the divisions of an enterprise.

Recently, in many publications devoted to the use of information technology in the management of economic objects, the term "corporate information systems" is often used, which in them means the actual automated information systems of economic objects.

For group and corporate systems, the requirements for the reliability of operation and data safety are significantly increased. These properties are maintained by maintaining data, reference, and transactional integrity in the database servers.

Classification by scope

According to the scope of application, information systems are usually divided into four groups:

* transaction processing systems;

* decision making systems;

* information and reference systems;

* office information systems.

Bibliography

Similar documents

Password recovery