Peer-to-Peer Issues

Here is a real-world discussion from November 2007 concerning network performance issues and related problems.

Situation

One of my clients is experiencing severe network problems. Here are the facts:

  The network is a peer-to-peer running XP-PRO and XP-HOME on eight PC's.

  Their version of Ashell is 4.8(480)

  ashell.log shows error # 183 and "connection to network lost".

  The only network application they run is Tentmaker.

  When we converted from the Alpha to the PC, we left the DSK structure exactly as it was on the Alpha. Hence, they have DSK0 thru DSK20 on their system. Data files are on DSK1-DSK7. The RUN code is on DSK8 thru DSK14. Temp files are on DSK15-DSK20.

This is the only site that I have with problems. The system just "locks up". To restart the system, the sysop has to ZAP the phantom jobs. They can't erase QFLOCK.SYS.

Comments

These are tough problems to solve. I would estimate that over the last 10-15 years, we have seen a dozen of these kind of extreme cases, where network flakiness is making it difficult to use the system. I'm not sure that any single factor was responsible for more than a couple of the problems, but the list of problem causes, in order of popularity, would probably be:

  Virus checking software out of control. (In one extreme example, one of the PCs had been configured to run a virus check on any new media on the network, so that when another PC put a diskette in, the network ground to a halt, even though the user at that PC had no idea.) But any individual PC could be configured to run real-time virus checking across the network to the server, which would cause such a slow down as to possibly lead to "connection to network lost" as a result of timeouts.

  Conflicting software usage. While the business application (A-Shell, Tentmaker) invariably is blamed for the problem (with the common argument being that nothing else is experiencing this problem), often we find that on a network of more than 5 PCs, it is not uncommon for one of them to be involved in some software or file-sharing activity that is not really authorized (like downloading/storing music, games, etc.) or is simply a background hog (Exchange Server), or spyware that is generating extreme CPU and network activity. In the first case, the culprit is not likely to complain, and in the other two, the software doesn't complain, even as it wreaks havoc on the rest of the applications sharing the network.

  Bad hardware components—cables, network cards, switches. These are very hard to diagnose. In some cases, a packet analyzer (or even just using the netstat utility) might be able to give you a clue that there are a lot of network errors, but it's pretty difficult to pin down the source, without the kind of concerted effort that end-users have little patience for. Fortunately, most of the components are cheap and require little or no configuration, so you might, for example, simply try swapping out the main hub/switch. (I've had to do that at least twice on my office LAN, and in both cases the symptoms were similar to what you describe, i.e., general flakiness.)

  OS/configuration conflicts: When you have a mixture of operating system versions, the potential for non-optimum overlapping/conflicting configurations is great. The ideal peer-to-peer network environment would use a Domain server (W200x Server) that everyone logged into from XP-SP2 clients. The least ideal is one with no real server and a mixture of "Home" class clients.

Unfortunately, none of the above information is really helpful to you, in the sense of offering an easy solution. Assuming that it is not practical to have a network technician come in and do a thorough overhaul of the network, if I were in your position, I would consider just switching them to the ATS/ATE model (replacing the peer-to-peer connections with telnet connections). It's not a totally trivial switch, particularly if it's the first time, and it will almost certainly require an update since I don't think ATS/ATE really functioned under 4.8, and there are licensing costs, but I would be willing to help you set up an eval to see if it actually helped.

The ways in which ATS/ATE can help are:

  Allows you to totally eliminate file sharing, at least for the device acting as the server. This can result is very substantial performance gains, since peer-to-peer I/O can be very inefficient when multiple users are sharing a file. (In fact, if you don't disable the file sharing for the shared A-Shell devices, you lose almost all of the performance gain.)

  If the problems are associated more with one client than another, this will be more readily apparent, and also less likely to cause problems for the other clients, since telnet disconnects can be trapped in software and automatically cleaned up, unlike peer-to-peer network problems.

  If the application processes more file data than screen I/O (which is virtually always the case for business applications), the load on the network is greatly reduced, since with telnet, the only data passing over the network is the screen I/O. An extreme but typical example is a report program, that might process 5MB of file data, while it displays a few hundred bytes on the screen.

  ATS allows you to remotely connect to the server, which is good for diagnosis and any other offsite access needs. All this requires is a typical inexpensive broadband Internet connection (DSL, cable) and a typical DSL/cable router (selling everywhere for well under $100).